Preface

Overview

What follows is a coded work through of Thomas Lumley’s “Complex Surveys: A Guide to Analysis in R” (Lumley 2011) and the methods underlying design based statistics more broadly. This write-up reflects my understanding of the material with additions made to try and clarify ideas further. These include simulations, derivations or references that I found helpful in working through Lumley’s material. Most of the datasets that I use throughout these notes are from Lumley’s website for the book.

Conducting the Sampling itself

It wasn’t very long into reading this book that I found that weighted sampling in R is, unfortunately, not well set-up for complex designs. Further details are in this stats exchange post. It would not be possible, for example, to simply use the popular tidyverse package dplyr’s function slice_sample() to draw a weighted sample from a population for example with appropriate inclusion probabilities. Instead a function from the sampling package would have to be used. I use this function below in any setting where a non-uniform sample with inclusion probabilities is needed.

Its worth further pointing out that the topic of how samples themselves are drawn is a complicated one its own right and that the functions in the sampling package each have pros and cons according to the target estimand or question. Drawing samples with replicate weights – discussed in Chapter 2 – is a similarly complex question which I haven’t yet resolved to my satisfaction.

Chapter 1: The Basics

Design vs. Model

Key to analysis of complex surveys is the idea of studying the design from which the data are constructed, rather than the data itself. In a traditional survey setting the data are assumed to be fixed and the probabilities of sampling different entities are used to derive the desired estimate. These inclusion probabilities or their inverse, “sampling weights”, are used to re-balance the data so that they more accurately reflect the target population distribution. Different sampling techniques — clustering, 2-phase, etc. — are used to either decrease the variance of the resulting estimate, the cost associated with the design or both.

Horvitz Thompson Estimation

The Horvitz Thompson Estimator (HTE) is the starting point for non-uniform random estimates. If we observe measure \(X_i\) on subject \(i\) of a population of \(N\) total subjects the HTE is formulated as follows:

\[ HTE(X) = \sum_{i=1}^{N} \frac{1}{\pi_i}X_i, \tag{1.1} \]

which is an unbiased estimator as shown in the Chapter 1 Appendix.

The variance of this estimate is then

\[ V[HTE(X)] = \sum_{i,j} \left ( \frac{X_i X_j}{\pi_{ij}} - \frac{X_i}{\pi_i} \frac{X_j}{\pi_j} \right ), \tag{1.2} \]

which follows from the Bernoulli covariance using indicator variables \(R_i=1\) if individual \(i\) is in the sample, \(R_i=0\) otherwise. A proof is provided in the Questions From Chapter 1.

Design And Misspecification Effects

(Kish 1965) defined the notion of a design effect as the ratio of a variance of an estimate in a complex sample to the variance of the same estimate in a simple random sample (SRS). The motivation for this entity being that it can guide researchers in terms of how much sample size they may need; If the sample size for a given level of precision is known for a simple random sample, the sample size for a complex design can be obtained by multiplying by the design effect.

While larger sample sizes may be necessary to maintain the same level of variance as a SRS, the more complex may still be more justified because of the lower cost associated. See (Meng 2018) for an example of where design effects are used in a modern statistical setting by comparing competing estimators.

Other preliminary items

From this point Lumley works through an introduction to the datasets used in the book and the idea that we’ll often be taking samples from datasets where we know the “true” population and computing estimates from there. This isn’t always the case and there may some subtlety worth discussing how to interpret results once we get into topics like regression, but for the most part his description makes sense.

One thing I found lacking in this introductory section is the motivation for why we might take non-uniform samples. It isn’t until Chapter 3 that Lumley discusses probability proportional to size (PPS) sampling, but this is very often the reason why a non-uniform sample is used.

If we have some measure that is right skewed in our population of interest and we’d like to estimate the mean, we could take a SRS to estimate the mean but the variance on that item would be lower than if we sampled proportional to the right skew measure itself. I’ll demonstrate with the following quick example, suppose we want to measure the income of a population. Incomes are often right skewed, but we can get a lower variance estimate if we take a weighted sample.

I generate a right skewed population and visualize the distribution.

population <- tibble(
  id = 1:1E3,
  income = 5E4 * rlnorm(1E3, meanlog = 0, sdlog = 0.5)
)
population %>%
  ggplot(aes(x = income)) +
  geom_histogram() +
  geom_vline(aes(xintercept = mean(income)), linetype = 2, color = "red") +
  ggtitle("Simulated Income Distribution") +
  xlab("Income (USD)")

Here I’ll take a uniform and weighted sample of size 50. Note that the differences in the samples are subtle. They might not look all that different on visual inspection.

uniform_sample <- population %>%
  slice_sample(n = 50) %>%
  transmute(
    income = income,
    method = "uniform",
    pi = 50 / 1E3
  )

weighted_sample <- population %>%
  mutate(
    pi = sampling::inclusionprobabilities(floor(population$income), 50),
    in_weighted_sample = sampling::UPbrewer(pi) == 1
  ) %>%
  filter(in_weighted_sample) %>%
  transmute(
    income = income,
    pi = pi,
    method = "weighted"
  )

rbind(
  uniform_sample,
  weighted_sample
) %>%
  ggplot(aes(x = income, fill = method)) +
  geom_histogram() +
  ggtitle("Sample Comparisons") +
  xlab("Income (USD)") +
  theme(legend.title = element_blank(), legend.position = "top")

Finally I’ll estimate the population mean from both samples and include the design effect calculation in the weighted sample estimate.

uniform_sample %>%
  as_survey_design() %>%
  summarize(
    mean_income = survey_mean(income)
  )
## # A tibble: 1 × 2
##   mean_income mean_income_se
##         <dbl>          <dbl>
## 1      51464.          3699.
weighted_sample %>%
  as_survey_design(probs = pi) %>%
  summarize(mean_income = survey_mean(income, deff = TRUE))
## # A tibble: 1 × 3
##   mean_income mean_income_se mean_income_deff
##         <dbl>          <dbl>            <dbl>
## 1      61475.          3412.            0.997

We see that the weighted estimate standard error is not quite half the uniform estimate. Accordingly the design effect for the weighted sample is less than 1.

Questions From Chapter 1

1.1-1.2 Don’t make sense to reproduce here.

1.3 Each visit to the front page of a newspaper’s website has (independently) a 1/1000 chance of resulting in a questionnaire on voting intentions in a forthcoming election. Assuming that everyone who is given the questionnaire responds, why are the results not a probability sample of

  • Voters?
  • Readers of the newspaper?
  • Readers of the newspaper’s online version?

Lumley lists 4 properties needed for a sample to be considered a probability sample.

  1. Every individual (unit of analysis) in the population must have a non-zero probability of ending up in the sample (\(\pi_i>0 \forall i\))

  2. \(\pi_i\) must be known for every individual who does end up in the sample.

  3. Every pair of individuals in the sample must have a non-zero probability of both ending up in the sample (\(\pi_{i,j} \forall i, j\))

  4. The probability \(\pi_{i,j}\) must be known for every pair that does end up in the sample.

1 is not guaranteed when considering voters — there are voters who don’t read the paper who have will have \(\pi_i = 0\) — or the broader heading of “readers” of the newspaper - since those who only read the physical paper will have a $_i = 0 $. For “readers of the newspaper’s online version” the sample would only be a probability sample if the time window was further specified, as there could be online readers who do not visit during the survey window, and would thus be assigned a \(\pi_i=0\).

1.4 You are conducting a survey that will estimate the proportion of women who used anti-malarial insecticide-treated bed nets every night during their last pregnancy. With a simple random sample you would need to recruit 50 women in any sub-population where you wanted a standard error of less than 5 percentage points in the estimate. You are using a sampling design that has given design effects of 2-3 for proportions in previous studies in similar areas.

  • Will you need a larger or smaller sample size than 50 for a subpopulation to get the desired precision?

Larger, a design effect \(>1\) indicates that the variance is larger in the complex design with the same sample size - consequently the sample size will need to be increased to maintain the same level of precision.

  • Approximately what sample size will you need to get the desired precision?

100 - 150. Derived from multiplying 50 by 2 and 3.

1.5 Systematic sampling involves taking a list of the population and choosing, for example, every 100th entry in the list.

  1. Which of the necessary properties of a probability sample does this have?

Items 2-4 from the list enumerated above. The only condition that is not satisfied is that not every item has a nonzero probability of being chosen.

  1. For systematic sampling with a random start, the procedure would be to choose a random starting point from 1, 2, …, 1000 and then take every 100th entry starting at the random point. Which of the necessary properties of a probability sample does this procedure have?

This satisfies all items from the above list.

  1. For systematic sampling with multiple random starts we might choose 5 random starting points in 1, 2, ….., 5000 and then take every 500th entry starting from each of the 5 random points. Which of the necessary properties of a probability sample does this procedure have?

Again, this satisfies all items from the above list.

  1. If the list were shuffled into a random order before a systematic sample was taken, which of the properties would the procedure have.

Again, all of them. The key is adding the known randomness and not excluding any items from selection.

  1. Treating a systematic sample as if it were a simple random sample often gives good results. Why would this be true?

This would be because the items are not ordered in any particular fashion prior to taking the “systematic sampling”. In this setting a systematic sample is equivalent to a simple random sample.

1.6 Why must all the sampling probabilities be non-zero to get a valid population estimate?

If any of the sampling probabilities are zero, that would introduce bias in shifting the estimate away from the portion of the population that would always be unobserved under repeated sampling.

1.7 Why must all the pairwise probabilities be non-zero to get a valid uncertainty estimate.

This is basically a second order statement equivalent to the previous. If any pair is unable to be observed together that is a form of selection bias that would shift the sample estimate away from the true population value.

1.8 A probability design assumes that people who are sampled will actually be included in the sample, rather than refusing. Look up the response rates for the most recent year of BRFSS and NHANES.

Lumley is highlighting the fact that even though we set up samples thinking that every sample will be observed that is rarely the case. Looking at just the most recent NHANES data I see response rates at ~ 78% for the unweighted, 6-year household survey.

1.9 In a telephone study using random digit dialing, telephone numbers are sampled with equal probability from a list. When a household is recruited, why is it necessary to ask how many telephones are in the household, and what should be done with this information in computing the weights.

It is necessary to ask how many telephones are in the household to downweight the a priori sampling probability accordingly because every additional telephone line increases the odds that a given house is sampled. For example in a simple population with two houses, where house one has 5 telephones and house two has 2 telephones, and we’re looking to take a \(n=1\) sample, but we don’t know the number of telephones a priori, house one has a \(\frac{5}{7}\) probability of being sampled. If that is the house that is chosen its weight needs to go from 2 to \(\frac{5}{7}\) to better reflect its sampling probability. In a real sample this would be corrected relative to all the other households number of telephones or perhaps a population average of the number of telephones.

1.10

Derive the Horvitz Thompson variance estimator for the total as follows.

  1. Write \(R_i = 1\) if individual \(i\) is in the sample, \(R_i=0\) otherwise. Show that \(V[R_i] = \pi_i(1-\pi_i)\) and that \(Cov[R_i,R_j]=\pi_{ij} - \pi_i\pi_j\).

This follows in a straightforward fashion from the assumption that \(R_i\) is distributed according to the bernoulli distribution and \(R_i \perp R_j\). This is an accurate model for sampling with replacement, or sampling from large populations with small sample sizes without replacement, but less true for small sample sizes without replacement.

  1. Show that the variance of the Horvitz Thompson estimator is:

\[ V[\hat{T}_{HT}] = \sum_{i=1}^{N}\sum_{j=1}^{N} \check{x}_i\check{x}_j(\pi_{ij} - \pi_i \pi_j) \] We have, \[ \hat{T}_{HT} := \sum_{i=1}^{N} \frac{X_i I(X_i \in {S})}{\pi_i} \\ V[\hat{T}_{HT}] = V\left[\sum_{i=1}^{N}\frac{X_i I(X_i \in {S})}{\pi_i} \right] \\ =\sum_{i=1}^{N}\sum_{j=1}^{N} Cov\left[\frac{X_i I(i \in {S})}{\pi_i}, \frac{X_j I(j \in {S})}{\pi_j}\right]\\ = \sum_{i=1}^{N}\sum_{j=1}^{N} \frac{X_i}{\pi_i}\frac{X_j}{\pi_j}Cov(I(i \in {S}),I(j \in {S})) \\ = \sum_{i=1}^{N}\sum_{j=1}^{N} \frac{X_i}{\pi_i}\frac{X_j}{\pi_j} (\pi_{ij} - \pi_i \pi_j) \]

which is equivalent to the above, where \(\check{x_i} = \frac{X_i}{\pi_i}\).

  1. Show that an unbiased estimator of the variance is \[ \hat{V}[\hat{T}_{HT}] = \sum_{i=1}^{N}\sum_{j=1}^{N} \frac{R_i R_j}{\pi_{ij}}\check{x_i}\check{x_j}(\pi_{ij} - \pi_i \pi_j) \]

To show the expression above is unbiased for \(\hat{V}[\hat{T}_{HT}]\) we must show that \(E\left [\hat{V}[\hat{T}_{HT}] \right] = V[\hat{T}_{HT}]\) \[ E \left [\hat{V}[\hat{T}_{HT} ] \right] = E \left[\sum_{i=1}^{N}\sum_{j=1}^{N} \frac{R_iR_j}{\pi_{ij}} \check{x}_i \check{x}_j(\pi_{ij} - \pi_i\pi_j) \right] \\ = \sum_{i=1}^{N} \sum_{j=1}^{N} \frac{E[R_iR_j]}{\pi_{ij}} \check{x}_i \check{x}_j (\pi_{ij} - \pi_i \pi_j) \\ = \sum_{i=1}^{N} \sum_{j=1}^{N} \frac{\pi_{ij}}{\pi_{ij}} \check{x}_i\check{x}_j(\pi_{ij} - \pi_i \pi_j) \\ = \sum_{i=1}^{N} \sum_{j=1}^{N} \check{x}_i\check{x}_j(\pi_{ij} - \pi_i \pi_j) \\ \blacksquare \]

  1. Show that the previous expression simplifies to equation 1.2

\[ \sum_{i=1}^{N} \sum_{j=1}^{N} \frac{R_iR_jx_ix_j}{\pi_{ij}\pi_i\pi_j}(\pi_{ij} - \pi_i \pi_j) \\ = \sum_{i=1}^{n} \sum_{j=1}^{n} \frac{x_ix_j}{\pi_{ij}\pi_i\pi_j}(\pi_{ij} - \pi_i \pi_j) \\ = \sum_{i=1}^{n} \sum_{j=1}^{n} x_i x_j(\frac{1}{\pi_i \pi_j} - \frac{1}{\pi_{ij}}) \\ = \sum_{i=1}^{n} \sum_{j=1}^{n} \frac{x_ix_j}{\pi_i \pi_j} - \frac{x_i x_j}{\pi_{ij}}\ \]

I’m not sure how the signs switch on the last line to reproduce expression 1.2

1.11 Another popular way to write the Horvitz-Thompson variance estimator is

\[ \hat{V}[\hat{T}_{HT}] = \sum_{i=1}^{n} x_i^{2} \frac{1-\pi_i}{\pi_i^2} + \sum_{i\neq j}x_ix_j\frac{\pi_{ij} - \pi_i \pi_j}{\pi_i\pi_j\pi_{ij}} \]

Show that this is equivalent to equation 1.2

We need to show that the above is equivalent to \[ \sum_{i,j} \frac{X_iX_j}{\pi_{ij}} - \frac{X_i}{\pi_i}\frac{X_j}{\pi_j} \]

First we fix \(i \neq j\) in the above expression and we find \[ \sum_{i\neq j} \frac{X_iX_j}{\pi_{ij}} - \frac{X_i}{\pi_i}\frac{X_j}{\pi_j} = \sum_{i\neq j} X_iX_j(\frac{1}{\pi_{ij}} - \frac{1}{\pi_i} \frac{1}{\pi_j}) \\ = \sum_{i \neq j} X_iX_j(\frac{\pi_i \pi_j - \pi_{ij}}{\pi_{ij}\pi_i\pi_j}) \]

which is the latter part in the desired expression save for a sign, which again I must be missing somehow or is an error in the book.

Now we take \(i=j\) and return to expression 1.2 in which we have,

\[ \sum_{i=j} \frac{X_i X_j}{\pi_{ij}} - \frac{X_i}{\pi_i} \frac{X_j}{\pi_j} \\ = \sum_{i=j} \frac{X_i^2}{\pi_{ii}} - \frac{X_i}{\pi_i} \frac{X_i}{\pi_i} \\ = \sum_{i=j} X_i^2 (\frac{1}{\pi_{ii}} - \frac{1}{\pi_i^2} ) \\ = \sum_{i=j} X_i^2 (\frac{\pi_i^2 - \pi_{ii}}{\pi_i^2 \pi_{ii}}) \] Clearly we have to formulate \(\pi_{ii}\) in terms of \(\pi_i\) but isn’t immediately clear to me how to do so. We know that for the two terms to be equal we must have

\[ \frac{\pi_i^2 - \pi_{ii}}{\pi_i^2 \pi_{ii}} = \frac{1 - \pi_i^2}{\pi_i^2} \\ \iff \\ \pi_{ii} = \frac{\pi_i^2}{2- \pi_i^2} \] Which I suppose we’ll take to be the expression of a co-inclusion probability of an entity sampled with itself (this must assume sampling with replacement) for this expression to be true.

Chapter 1 Appendix

The HTE is an unbiased estimator of the population total - I reproduce the expression from above, but now make explicit the indicator variables that express which observations are included in our sample, \(S\).

\[ HTE := \sum_{i=1}^{N} \frac{X_i I(X_i \in S)}{\pi_i} \\ E[HTE] = E\left [\sum_n \frac{X_i I(X_i \in S)}{\pi_i} \right ] \\ = \sum_n E \left [\frac{X_iI(X_i \in S)}{\pi_i} \right ] \\ = \sum_n \frac{X_iE[I(X_i \in S)]}{\pi_i} \\ = \sum_n \frac{X_i \pi_i}{\pi_i} = \sum_n X_i \]

Chapter 2: Simple and Stratified Sampling

Starting from Simple Random Samples

When dealing with a sample of size \(n\) from a population of size \(N\) the HTE of the total value of \(X_i\) in the population can be written as

\[ \begin{equation} HTE(X) = \hat{T_X} = \sum_{i=1}^{n} \frac{X_i}{\pi_i}. \end{equation} \]

For a simple random sample, the variance can be more explicitly written as

\[ \begin{equation} V[\hat{T_X}] = \frac{N-n}{N} \times N^{2} \times \frac{V[X]}{n}, \end{equation} \]

where \(\frac{N-n}{N}\) is the finite population correction factor. This factor is derived from the hypergeometric distribution and explains the reduction in uncertainty that follows from sampling a large portion of the population. Consequently, if the sample is taken with replacement — the same individual or unit has the possibility to be sampled twice — this term is no longer relevant. It should be noted that sampling with replacement is not usually used however, but sometimes this language is used to refer to the fact that the finite correction factor may not be used.

The second term, \(N^2\), rescales the estimate from the mean to the total, while the final term is simply the scaled variance of \(X\).

A point worth deliberating on, that Lumley notes as well, is that while the above equations suggest that a larger sample size is always better that is not always the case in reality. Non-response bias or the cost of surveys can dramatically diminish the quality of the dataset, even if the size is large. I state this is worth deliberating on because it is a matter of increasing importance in the world of “Big Data” - where it can be easy to delude oneself with confidence in their estimates because their sample is large, even when the sample is not well designed. See (Meng 2018) for a larger discussion of this topic.

It follows from the above that the HTE for the population size is defined as \(\hat{N} = \sum_{i=1}^{n} \frac{1}{\pi_i}\). This holds true in the case where, as here \(\pi_i = \frac{n}{N}\), a bit trivial, but also in those where \(\pi_i\) may be defined differently.

Confidence Intervals

The sampling distribution for the estimates — typically sample means and sums — across “repeated surveys” is Normal by the Central Limit Theorem, so the typical \(\bar{x} \pm 1.96 \sqrt{\frac{\sigma^2_X}{n}}\), expression is used to calculate a 95% confidence interval. Lumley offers the following example from the California Academic Performance Index (API) dataset to illustrate this idea.

data(api)
mn_enroll <- mean(apipop$enroll, na.rm = TRUE)
p1 <- apipop %>%
  ggplot(aes(x = enroll)) +
  geom_histogram() +
  xlab("Student Enrollment") +
  geom_vline(xintercept = mn_enroll, linetype = 2, color = "red") +
  ggtitle("Distribution of School Enrollment")
p2 <- replicate(n = 1000, {
  apipop %>%
    sample_n(200) %>%
    pull(enroll) %>%
    mean(., na.rm = TRUE)
})
mn_sample_mn <- mean(p2)
p2 <- tibble(sample_ix = 1:1000, sample_mean = p2) %>%
  ggplot(aes(x = sample_mean)) +
  geom_histogram() +
  xlab("Student Enrollment Averages") +
  geom_vline(
    xintercept = mn_sample_mn,
    linetype = 2, color = "red"
  ) +
  ggtitle("Distribution of Sample Means")
p1 + p2

Complex Sample Data in R

What follows is a work-up of basic survey estimates using the California API dataset composed of student standardized test scores. I’ll work through the code once using the survey package and a second time using the srvyr package, which has a tidyverse friendly API.

Much of the computational work in this book begins with creating a design object, from which weights and other information can then be drawn on for any number/type of estimates.

For example, we create a basic design object below, where we look at a classic simple random sample (SRS) of the schools in the API dataset. Let’s take a look at the dataset first.

dplyr::as_tibble(apisrs)
## # A tibble: 200 × 39
##    cds       stype name  sname  snum dname  dnum cname  cnum  flag pcttest api00
##    <chr>     <fct> <chr> <chr> <dbl> <chr> <int> <chr> <int> <int>   <int> <int>
##  1 15739081… H     "McF… McFa…  1039 McFa…   432 Kern     14    NA      98   462
##  2 19642126… E     "Sto… Stow…  1124 ABC …     1 Los …    18    NA     100   878
##  3 30664493… H     "Bre… Brea…  2868 Brea…    79 Oran…    29    NA      98   734
##  4 19644516… E     "Ala… Alam…  1273 Down…   187 Los …    18    NA      99   772
##  5 40688096… E     "Sun… Sunn…  4926 San …   640 San …    39    NA      99   739
##  6 19734456… E     "Los… Los …  2463 Haci…   284 Los …    18    NA      93   835
##  7 19647336… M     "Nor… Nort…  2031 Los …   401 Los …    18    NA      98   456
##  8 19647336… E     "Gla… Glas…  1736 Los …   401 Los …    18    NA      99   506
##  9 19648166… E     "Max… Maxs…  2142 Moun…   470 Los …    18    NA     100   543
## 10 38684786… E     "Tre… Trea…  4754 San …   632 San …    37    NA      90   649
## # ℹ 190 more rows
## # ℹ 27 more variables: api99 <int>, target <int>, growth <int>, sch.wide <fct>,
## #   comp.imp <fct>, both <fct>, awards <fct>, meals <int>, ell <int>,
## #   yr.rnd <fct>, mobility <int>, acs.k3 <int>, acs.46 <int>, acs.core <int>,
## #   pct.resp <int>, not.hsg <int>, hsg <int>, some.col <int>, col.grad <int>,
## #   grad.sch <int>, avg.ed <dbl>, full <int>, emer <int>, enroll <int>,
## #   api.stu <int>, pw <dbl>, fpc <dbl>

In the code below fpc stands for the aforementioned finite population correction factor and id=~1 designates the unit of analysis as each individual row in the dataset.

srs_design <- svydesign(id = ~1, fpc = ~fpc, data = apisrs)
srs_design
## Independent Sampling design
## svydesign(id = ~1, fpc = ~fpc, data = apisrs)

In order to calculate the mean enrollment based on this sample the, appropriately named, svymean function can be used.

svymean(~enroll, srs_design)
##          mean     SE
## enroll 584.61 27.368

This is the same as the typical computation - which makes sense, this is a SRS!

c(
  "Mean" = mean(apisrs$enroll),
  "SE" = sqrt(var(apisrs$enroll) / nrow(apisrs))
)
##      Mean        SE 
## 584.61000  27.82121

Instead of specifying the finite population correction factor, the sampling weights could be used - since this is a SRS, all the weights should be the same.

as_tibble(apisrs) %>% distinct(pw)
## # A tibble: 1 × 1
##      pw
##   <dbl>
## 1  31.0
nofpc <- svydesign(id = ~1, weights = ~pw, data = apisrs)
nofpc
## Independent Sampling design (with replacement)
## svydesign(id = ~1, weights = ~pw, data = apisrs)

Use svytotal to calculate the estimate of the total across all schools, note that the standard error will be different between the two designs because of the lack of fpc.

svytotal(~enroll, nofpc)
##          total     SE
## enroll 3621074 172325
svytotal(~enroll, srs_design)
##          total     SE
## enroll 3621074 169520

Totals across groups can be calculated using the ~ notation with a categorical variable.

svytotal(~stype, srs_design)
##          total     SE
## stypeE 4397.74 196.00
## stypeH  774.25 142.85
## stypeM 1022.01 160.33

svycontrast can be used to calculate the difference or addition of two different estimates - below we estimate the difference in the 2000 and 1999 scores based on the SRS design.

svycontrast(svymean(~ api00 + api99, srs_design), quote(api00 - api99))
##          nlcon     SE
## contrast  31.9 2.0905

Now again with the srvyr package

dstrata <- apisrs %>%
  as_survey_design(fpc = fpc)
dstrata %>%
  mutate(api_diff = api00 - api99) %>%
  summarise(
    enroll_total = survey_total(enroll, vartype = "ci"),
    api_diff = survey_mean(api_diff, vartype = "ci")
  ) %>%
  gt()
enroll_total enroll_total_low enroll_total_upp api_diff api_diff_low api_diff_upp
3621074 3286789 3955360 31.9 27.77764 36.02236

Stratified Sampling

Simple random samples are not often used in complex surveys because there is a justified concern that some strata (e.g. racial ethnic group, age group, etc.) may be underrepresented in the sample if a simple random sample were used. Similarly, complex designs can give the same precision at a lower cost. Consequently, a sample may be constructed so that some units are guaranteed to be included within a given strata - improving the resulting variance. When this is a simple random sample, the HTE and variance of the total population is simply the sum of the strata specific estimates; \(\hat{T}_{HT} = \sum_{s=1}^{S} \hat{T}^{s}_X\), where there are \(S\) strata within the population.

For example, in the apistrat data set a stratified random sample of 200 schools is recorded such that schools are sampled randomly within school type ( elementary, middle school or high school).

In the code below we can designate the strata using the categorical variable stype, which denotes each of the school type as strata.

strat_design <- svydesign(
  id = ~1,
  strata = ~stype,
  fpc = ~fpc,
  data = apistrat
)
strat_design
## Stratified Independent Sampling design
## svydesign(id = ~1, strata = ~stype, fpc = ~fpc, data = apistrat)
svytotal(~enroll, strat_design)
##          total     SE
## enroll 3687178 114642
svymean(~enroll, strat_design)
##          mean     SE
## enroll 595.28 18.509
svytotal(~stype, strat_design)
##        total SE
## stypeE  4421  0
## stypeH   755  0
## stypeM  1018  0

Note that are standard errors are 0 for the within strata estimates because if we have strata information on each member of the population, then we know the strata counts without any uncertainty.

Several points worth noting about stratified samples before moving on.

  • Stratified samples get their power from “conditioning” on the strata information that explain some of the variability in the measure.

  • Whereas a SRS might have a chance of leaving out an elementary or middle school, and leaving a higher estimate of enrollment, because of a higher number of highschools in the sample, keeping a fixed number of samples within each strata removes this problem.

  • Stratified analysis may also refer to something entirely different from what we’re discussing here — where a subgroup has some model or estimate fit only on that subgroup’s data exclusively.

Now again with the srvyr package

srvyr_strat_design <- apistrat %>%
  as_survey_design(
    strata = stype,
    fpc = fpc
  )
srvyr_strat_design
## Stratified Independent Sampling design
## Called via srvyr
## Sampling variables:
##  - ids: `1`
##  - strata: stype
##  - fpc: fpc
## Data variables: cds (chr), stype (fct), name (chr), sname (chr), snum (dbl),
##   dname (chr), dnum (int), cname (chr), cnum (int), flag (int), pcttest (int),
##   api00 (int), api99 (int), target (int), growth (int), sch.wide (fct),
##   comp.imp (fct), both (fct), awards (fct), meals (int), ell (int), yr.rnd
##   (fct), mobility (int), acs.k3 (int), acs.46 (int), acs.core (int), pct.resp
##   (int), not.hsg (int), hsg (int), some.col (int), col.grad (int), grad.sch
##   (int), avg.ed (dbl), full (int), emer (int), enroll (int), api.stu (int), pw
##   (dbl), fpc (dbl)
srvyr_strat_design %>%
  summarise(
    enroll_total = survey_total(enroll),
    enroll_mean = survey_mean(enroll)
  ) %>%
  gt()
enroll_total enroll_total_se enroll_mean enroll_mean_se
3687178 114641.7 595.2821 18.50851
srvyr_strat_design %>%
  group_by(stype) %>%
  survey_count()
## # A tibble: 3 × 3
## # Groups:   stype [3]
##   stype     n     n_se
##   <fct> <dbl>    <dbl>
## 1 E     4421. 2.82e-13
## 2 H      755. 9.81e-14
## 3 M     1018. 7.42e-14

Replicate Weights

Replicate weights exploit sub-sampling to derive more generalizable statistics than sampling weights. This is particularly useful when estimating a “nonparametric” statistic like the median or a quantile which doesn’t have an easily derived variance.

For a basic idea of why this works, Lumley notes that one could estimate the variance of a total by using two independent half samples estimating the same total, i.e. if \(\hat{T}_A\) and \(\hat{T}_B\) are both from two independent half samples estimating \(\hat{T}\) then the variance of the difference of the two half samples is proportional to the variance of the original total:

\[ E\left[ (\hat{T}_A - \hat{T}_B)^2 \right] = 2 V[\hat{T}_A] = 4 V[\hat{T}]. \]

There are multiple ways one might set up these splits that are more efficient than the straightforward “half” sample described above - Lumley discusses 3 variants briefly:

  1. Balanced Repeated Replication (BRR) Based on the work of (McCarthy 1966).
  • (Judkins 1990), extends BRR to handle issues with sparse signals and small samples.
  1. Jackknife
  • Because BRR and Fay’s method is difficult with other designs using overlapping subsamples, Jackknife and the bootstrap are intended to be more flexible.
  1. Bootstrap
  • This is the method I’m most familiar with, outside of complex designs.
  • Lumley states that using the Bootstrap in this setting involves taking a sample (with replacement) of observations or clusters and multiplying the sampling weight by the number of times the observation appears in the sample.

Each of these ideas relies on the fundamental idea that we can calculate the variance of our statistic of interest by using — sometimes carefully chosen — subsamples of our original sample to calculate our statistic of interest and more importantly, the variance of that statistic. Lumley’s use of the equation above gives the basic idea but I believe the more rigorous justification appeals to theory involving empiricial distribution functions, as much of the theory underlying these ideas relies on getting a good estimate of the empirical distribution.

It isn’t explicitly clear which of these techniques is most popular currently, but my guess would be that the bootstrap is the most used. This also happens to be the method that Lumley has provided the most citations for in the text. I’ve also run into cases where the US Census IPUMS data uses successive difference weights.

All this to say that replicate weights are powerful for producing “non-parametric” estimates, like quantiles and so on, and different weighting techniques may be more or less appropriate depending on the design and data involved.

Replicate Weights in R

Lumley first demonstrates how to setup a survey design object when the weights are already provided. I’ve had trouble accessing the 2005 California Health Interview Survey data on my own but he thankfully provides a link to the data on his website.

chis_adult <- as.data.frame(read_dta("Data/ADULT.dta")) %>%
  # have to convert labeled numerics to regular numerics for
  # computation in survey package.
  mutate(
    bmi_p = as.numeric(bmi_p),
    srsex = factor(srsex, labels = c("MALE", "FEMALE")),
    racehpr = factor(racehpr, labels = c(
      "LATINO", "PACIFIC ISLANDER",
      "AMERICAN INDIAN/ALASKAN NATIVE",
      "ASIAN", "AFRICAN AMERICAN",
      "WHITE",
      "OTHER SINGLE/MULTIPLE RACE"
    ))
  )
chis <- svrepdesign(
  variables = chis_adult[, 1:418],
  repweights = chis_adult[, 420:499],
  weights = chis_adult[, 419, drop = TRUE],
  ## combined.weights specifies that the replicate weights
  ## include the sampling weights
  combined.weights = TRUE,
  type = "other", scale = 1, rscales = 1
)
chis
## Call: svrepdesign.default(variables = chis_adult[, 1:418], repweights = chis_adult[, 
##     420:499], weights = chis_adult[, 419, drop = TRUE], combined.weights = TRUE, 
##     type = "other", scale = 1, rscales = 1)
## with 80 replicates.

When creating replicate weights in R one specifies a replicate type to the type argument.

boot_design <- as.svrepdesign(strat_design,
  type = "bootstrap",
  replicates = 100
)
boot_design
## Call: as.svrepdesign.default(strat_design, type = "bootstrap", replicates = 100)
## Survey bootstrap with 100 replicates.

By default, the as.svrepdesign() function assumes the replicate weights are jackknife replicates.

## jackknife is the default
jk_design <- as.svrepdesign(strat_design)
jk_design
## Call: as.svrepdesign.default(strat_design)
## Stratified cluster jackknife (JKn) with 200 replicates.

Once the design object is created the mean of a variable can computed equivalently as before using the svymean() function. We’ll compare the bootstrap and jackknife estimates, noting that the bootstrap has a higher standard error than the jackknife.

svymean(~enroll, boot_design)
##          mean     SE
## enroll 595.28 18.437
svymean(~enroll, jk_design)
##          mean     SE
## enroll 595.28 18.509

Of course, part of the motivation in using replicate weights is that you’re able to estimate standard errors for non-trivial estimands, especially those that may not be implemented in the survey package. Lumley demonstrates this using a sample from the National Wilms Tumor Study Cohort, in order to estimate the five year survival probability via a Kaplan-Meier Estimator.

library(addhazard)
nwtsco <- as_tibble(nwtsco)
head(nwtsco)
## # A tibble: 6 × 12
##    trel  tsur relaps  dead study stage histol instit   age    yr specwgt tumdiam
##   <dbl> <dbl>  <int> <int> <int> <int>  <int>  <int> <dbl> <int>   <int>   <int>
## 1 21.9  21.9       0     0     3     1      1      1  2.08  1979     750      14
## 2 11.3  11.3       0     0     3     2      0      0  4.17  1979     590       9
## 3 22.1  22.1       0     0     3     1      1      1  0.75  1979     356      13
## 4  8.02  8.02      0     0     3     2      0      0  2.67  1979     325       9
## 5 20.5  20.5       0     0     3     2      0      0  3.67  1979     970      17
## 6 14.4  14.4       1     1     3     2      0      1  2.58  1979     730      15
cases <- nwtsco %>% filter(relaps == 1)
cases <- cases %>% mutate(wt = 1)
ctrls <- nwtsco %>% filter(relaps == 0)
ctrls <- ctrls %>%
  mutate(wt = 10) %>%
  sample_n(325)
ntw_sample <- rbind(cases, ctrls)

fivesurv <- function(time, status, w) {
  scurve <- survfit(Surv(time, status) ~ 1, weights = w)
  ## minimum probability that corresponds to a survival time > 5 years
  return(scurve$surv[min(which(scurve$time > 5))])
}

des <- svydesign(id = ~1, strata = ~relaps, weights = ~wt, data = ntw_sample)
jkdes <- as.svrepdesign(des)
withReplicates(jkdes, quote(fivesurv(trel, relaps, .weights)))
##        theta     SE
## [1,] 0.83589 0.0017

The estimated five year survival probability of 84% (95% CI: 84%,85%) uses the fivesurv function which computes the kaplan meier estimate of five year survival probability fora given time status and weight. The withReplicates function then re-estimates this statistic using each set of replicates and calculates the standard error from the variability of these estimates.

Its worth noting that this is the standard error for estimating the five year survival in the NWTS cohort, not the hypothetical superpopulation of all children with Wilms’ tumor.

Now again with the srvyr package

boot_design <- as_survey_rep(strat_design,
  type = "bootstrap",
  replicates = 100
)
boot_design
## Call: Called via srvyr
## Survey bootstrap with 100 replicates.
## Data variables: cds (chr), stype (fct), name (chr), sname (chr), snum (dbl),
##   dname (chr), dnum (int), cname (chr), cnum (int), flag (int), pcttest (int),
##   api00 (int), api99 (int), target (int), growth (int), sch.wide (fct),
##   comp.imp (fct), both (fct), awards (fct), meals (int), ell (int), yr.rnd
##   (fct), mobility (int), acs.k3 (int), acs.46 (int), acs.core (int), pct.resp
##   (int), not.hsg (int), hsg (int), some.col (int), col.grad (int), grad.sch
##   (int), avg.ed (dbl), full (int), emer (int), enroll (int), api.stu (int), pw
##   (dbl), fpc (dbl)
boot_design %>% summarise(Mean = survey_mean(enroll))
## # A tibble: 1 × 2
##    Mean Mean_se
##   <dbl>   <dbl>
## 1  595.    18.3

It’s not clear or straightforward to me from reading the srvyr docs how to estimate the weighted survival function probability — I may return to this later.

Final Notes on Replicate Weights

Lumley finishes this section by noting that the bootstrap typically works better when all the strata are large. While a strata correction is available it is likely not correct for small or unequal strata.

Separately, Lumley note that both the jackknife and bootstrap can incorporate finite population correction factors.

Finally, the BRR designs implemented in the survey package will use at most excess 4 replicate splits for \(K < 180\) and at most 5% when \(K > 100\). It is not clear to me from the reading, which is more likely to be used for \(100 < K < 180\).

Other population summaries

While population means, totals, and differences are typically easy to estimate via the horvitz thompson estimator there are other population statistics such as the median or regression estimates that are more complex. These require using the replicate weights described in the previous section or making certain linearization / interpolation assumptions which may or may not influence the resulting estimate.

Quantiles

Estimation of quantiles involves estimating arbitary points along the cumulative distribution function(cdf). For example, the 90th percentile has 90% of the estimated population size below it and 10% above. In this case, for cdf \(F_X(x)\), we want to estimate \(x: F_X(x) = 0.9\). However, estimating the cdf presents some technical difficulties in that the empirical cumulative distribution function (ecdf), is not typically a “smooth” estimate for any given \(x\) — as the estimate is highly dependent upon the sample. Consequently, Lumley’s function, svyquantile() interpolates linearly between two adjacent observations when the quantile is not uniquely defined.

samp <- rnorm(20)
plot(ecdf(samp))
Empirical Cumulative Distribution Function - note the jumps at distinctive points along the x-axis.

Empirical Cumulative Distribution Function - note the jumps at distinctive points along the x-axis.

Confidence intervals are constructed similarly, using the ecdf, though it should be noted that estimating extreme quantiles poses difficulties because of the low density values in the area.

A first calculation to demonstrate this using replicate weights with the CA health interview study, estimating different quantiles of BMI.

svyquantile(~bmi_p, design = chis, quantiles = c(0.25, 0.5, 0.75))
## $bmi_p
##      quantile ci.2.5 ci.97.5         se
## 0.25    22.68  22.66   22.81 0.03767982
## 0.5     25.75  25.69   25.80 0.02763161
## 0.75    29.18  29.12   29.29 0.04270393
## 
## attr(,"hasci")
## [1] TRUE
## attr(,"class")
## [1] "newsvyquantile"

The same thing can be done with the stratified design. Here the uncertainty is computed via the estimates of the ecdf and finding the pointwise confidence interval for different points along the curve.

svyquantile(~api00,
  design = strat_design, quantiles = c(0.25, 0.5, 0.75),
  ci = TRUE
)
## $api00
##      quantile ci.2.5 ci.97.5       se
## 0.25      565    535     597 15.71945
## 0.5       668    642     694 13.18406
## 0.75      756    726     778 13.18406
## 
## attr(,"hasci")
## [1] TRUE
## attr(,"class")
## [1] "newsvyquantile"

You can see how to construct the same estimate below using the srvyr package.

srvyr_strat_design %>%
  summarize(quantiles = survey_quantile(api00, quantiles = c(0.25, 0.5, 0.75)))
## # A tibble: 1 × 6
##   quantiles_q25 quantiles_q50 quantiles_q75 quantiles_q25_se quantiles_q50_se
##           <dbl>         <dbl>         <dbl>            <dbl>            <dbl>
## 1           565           668           756             15.7             13.2
## # ℹ 1 more variable: quantiles_q75_se <dbl>

Contingency Tables

Lumley’s main points in this section focus on the complications in interpretation of typical contingency table tests of association in a design based setting. Specifically, he points out that it is not obvious how the null distribution should be generated without making some kind of modeling assumptions. Quoting from the book (text in parentheses from me):

For example, if there are 56,181,887 women and 62,710,561 men in a population it is not possible for the proportions of men and women who are unemployed to be the same, since these population sizes have no common factors. We would know without collecting any employment data that the finite-population null hypothesis (of equal proportions) was false. A more interesting question is whether the finite population could have arisen from a process that had no association between the variables: is the difference at the population level small enough to have arisen by chance…. A simpler approach is to treat the sample as if it came from an infinite superpopulation and simply ignore the finite-population corrections in inference.

The super-population approach offers the more interesting approach philosophically and thus is implemented in the survey package. The svychisq function implements a test for no association as the null using a chi-squared distribution with a correction for the mean and variance. Lumley discusses various methods for computing the \(\chi^2\) statistic in this setting and their implementations in svycontrast(). I’d suggest looking at the function documentation if that level of detail is needed.

Lumley demonstrates how to call these functions estimating the proportion of smokers in each insurance status group from the California Health Interview Survey.

tab <- svymean(~ interaction(ins, smoking, drop = TRUE), chis)
tab
##                                               mean     SE
## interaction(ins, smoking, drop = TRUE)1.1 0.112159 0.0021
## interaction(ins, smoking, drop = TRUE)2.1 0.039402 0.0015
## interaction(ins, smoking, drop = TRUE)1.2 0.218909 0.0026
## interaction(ins, smoking, drop = TRUE)2.2 0.026470 0.0012
## interaction(ins, smoking, drop = TRUE)1.3 0.507728 0.0036
## interaction(ins, smoking, drop = TRUE)2.3 0.095332 0.0022
ftab <- ftable(tab, rownames = list(
  ins = c("Insured", "Uninsured"),
  smoking = c("Current", "Former", "Never")
))
round(ftab * 100, 1)
##              ins Insured Uninsured
## smoking                           
## Current mean        11.2       3.9
##         SE           0.2       0.1
## Former  mean        21.9       2.6
##         SE           0.3       0.1
## Never   mean        50.8       9.5
##         SE           0.4       0.2

In the output below we see a very small p-value indicating that the data were unlikely to be generated from a process in which smoking and insurance status were independent.

svychisq(~ smoking + ins, chis)
## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~smoking + ins, chis)
## F = 130.11, ndf = 1.9923, ddf = 157.3884, p-value < 2.2e-16

Estimates in Subpopulations

Estimation within subpopulations (also called domain estimation) that are sampled strata is easy since a stratified sample is composed of random samples within strata by definition; simply compute the desired statistic within the given strata using the strata-specific random sample.

When the subpopulation of interest is not a strata, things are more difficult. While the sampling weights would still be correct for representing any given observation to the total population — resulting in an unbiased mean point estimate — the co-occurrence probabilities \(\pi_{i,j}\) would be incorrect, because the co-occurrence probabilities are now unknown/random and not fixed by design. The two (usual) approaches for trying to estimate subpopulation variance estimates in-spite of these difficulties are linearization and replicate weighting. The computation is straightforward for replicate weighting since the non-subpopulation entities can simply be discarded in the computation. For linearization the computation is less straightforward as the extra entities still have to be included as “0”s in the final computation – this is according to Lumley’s arguments in his Appendix.

Examples below demonstrating this idea estimate the number of teachers with emergency, emer, training amongst California schools using the api dataset.

emerg_high <- subset(strat_design, emer > 20)
emerg_low <- subset(strat_design, emer == 0)
svymean(~ api00 + api99, emerg_high)
##         mean     SE
## api00 558.52 21.708
## api99 523.99 21.584
svymean(~ api00 + api00, emerg_low)
##         mean     SE
## api00 749.09 17.516
svytotal(~enroll, emerg_high)
##         total     SE
## enroll 762132 128674
svytotal(~enroll, emerg_low)
##         total    SE
## enroll 461690 75813

In general, if replicate weights are available, domain estimation is much easier.

bys <- svyby(~bmi_p, ~ srsex + racehpr, svymean,
  design = chis,
  keep.names = FALSE
)
print(bys, digits = 3)
##     srsex                        racehpr bmi_p     se
## 1    MALE                         LATINO  28.2 0.1447
## 2  FEMALE                         LATINO  27.5 0.1443
## 3    MALE               PACIFIC ISLANDER  29.7 0.7055
## 4  FEMALE               PACIFIC ISLANDER  27.8 0.9746
## 5    MALE AMERICAN INDIAN/ALASKAN NATIVE  28.8 0.5461
## 6  FEMALE AMERICAN INDIAN/ALASKAN NATIVE  27.0 0.4212
## 7    MALE                          ASIAN  24.9 0.1406
## 8  FEMALE                          ASIAN  23.0 0.1112
## 9    MALE               AFRICAN AMERICAN  28.0 0.2663
## 10 FEMALE               AFRICAN AMERICAN  28.4 0.2417
## 11   MALE                          WHITE  27.0 0.0598
## 12 FEMALE                          WHITE  25.6 0.0680
## 13   MALE     OTHER SINGLE/MULTIPLE RACE  26.9 0.3742
## 14 FEMALE     OTHER SINGLE/MULTIPLE RACE  26.7 0.3158
# This is the code from the book but it didn't work for me
# because of issues in the survey R package, I reproduce the
# first result using the srvyr package below
# medians <- svyby(~bmi_p, ~ srsex + racehpr, svyquantile,
#   design = chis,
#   covmat = TRUE,
#   quantiles = 0.5
# )
# svycontrast(medians, quote(MALE.LATINO/FEMALE.LATINO))

medians <- chis %>%
  as_survey() %>%
  group_by(srsex, racehpr) %>%
  summarize(Median_BMI = survey_median(bmi_p, vartype = "ci"))

medians
## # A tibble: 14 × 5
## # Groups:   srsex [2]
##    srsex  racehpr                       Median_BMI Median_BMI_low Median_BMI_upp
##    <fct>  <fct>                              <dbl>          <dbl>          <dbl>
##  1 MALE   LATINO                              27.4           27.3           27.8
##  2 MALE   PACIFIC ISLANDER                    28.9           27.8           30.7
##  3 MALE   AMERICAN INDIAN/ALASKAN NATI…       28.4           26.9           29.3
##  4 MALE   ASIAN                               24.4           24.2           24.8
##  5 MALE   AFRICAN AMERICAN                    27.4           26.6           28.1
##  6 MALE   WHITE                               26.4           26.3           26.5
##  7 MALE   OTHER SINGLE/MULTIPLE RACE          26.6           25.9           27.5
##  8 FEMALE LATINO                              26.3           25.8           26.5
##  9 FEMALE PACIFIC ISLANDER                    27.3           25.6           28.3
## 10 FEMALE AMERICAN INDIAN/ALASKAN NATI…       25.1           24.5           26.5
## 11 FEMALE ASIAN                               22.1           22.0           22.4
## 12 FEMALE AFRICAN AMERICAN                    27.2           26.6           27.5
## 13 FEMALE WHITE                               24.3           24.2           24.4
## 14 FEMALE OTHER SINGLE/MULTIPLE RACE          25.7           25.1           26.5

Design of Stratified Samples

How to pick the sample size for each strata? Well it depends on the goals of the analysis. If the goal is to estimate a total across the whole population, the formula for the variance of a total can be used to gain insights about optimal allocation. Since the variance of the total is dependent (via sum) of the strata specific variances, more sample size would want to be dedicated to more heterogeneous and/or larger strata.

This general approach means that the sample size for strata \(k\), \(n_k\) should be proportional to the population strata size \(N_k\) and strata variance \(\sigma^{2}_k\), \(n_k \propto \sqrt{N^2_k \sigma^2_k} = N_k \sigma_k\). Lumley notes that while this expression satisfies some theoretical optimality criteria, it is often the case that different strata have different costs associated with their sampling and so the expression can be modified in order to take into account this cost as follows:

\[ n_k \propto \frac{ N_k \sigma_k}{\sqrt{\text{cost}_k}}, \]

where cost\(_k\) is the cost of sampling for strata \(k\).

Questions

1.You are conducting a survey of emergency preparedness at a large HMO, where you want to estimate what proportion of medical staff would be able to get to work after an earthquake.

a.) You can either send out a single questionnaire to all staff, or send out a questionnaire to about 10% of the staff and make follow-up phone calls for those that don’t respond. What are the disadvantages of each approach?

This comes down to a discussion of cost for sampling and what missing data mechanism may be at play. As a simple starting point, if we were to assume the resulting data were MCAR and the non response rate was equivalent between both sampling strategies, the single questionnaire would be preferred because it would result in a higher overall sample size. These assumptions are probably not likely however, and we may expect that non-response is associated with other meaningful factors, by choosing a the follow-up phone call we might minimize non-response to both reduce bias and improve precision.

Additional relevant concerns would be the possible response or lack of response of certain strata — certain physicians, technicians or other kinds of staff’s response would likely be worth knowing yet these groups may be less well represented in a 10% simple random sample of the population.

b.) You choose to survey just a sample. What would be useful variables to stratify the sampling, and why?

The aforementioned job title would be useful to stratify on. This would likely be most useful to conduct within each department. Further, if the HMO has more than one site or clinic, that would be worth stratifying on as well for substantive reasons just as much as statistical reasons.

c.) The survey was conducted with just two strata: physicians and other staff. The HMO has 900 physicians and 9000 other staff. You sample 450 physicians and 450 other staff. What are the sampling probabilities in each stratum?

physician strata sampling probabilities are \(\frac{n}{N_k} = \frac{450}{900} = \frac{1}{2}\), while the “other staff” probabilities are \(\frac{450}{9000} = \frac{1}{20}\)

d.) 300 physicians and 150 other staff say they would be able to get to work after an earthquake. Give unbiased estimates of the proportion in each stratum and the total proportion.

The physician strata estimate would be \(\frac{300}{450} = \frac{2}{3}\). The staff strata would be \(\frac{150}{450} = \frac{1}{3}\) The total proportion would be \(\frac{2 \times 300 + 20 \times 150}{9900}\). This value can be recreated below with the survey package as follows.

df <- tibble(
  id = 1:900,
  job = c(rep("MD", 450), rep("staff", 450)),
  prep = c(rep(1, 300), rep(0, 150), rep(1, 150), rep(0, 300)),
  weights = c(rep(2, 450), rep(20, 450))
)
hmo_design <- svydesign(strata = ~job, ids = ~0, weights = ~weights, data = df)
hmo_design
## Stratified Independent Sampling design (with replacement)
## svydesign(strata = ~job, ids = ~0, weights = ~weights, data = df)
svymean(~prep, hmo_design)
##         mean     SE
## prep 0.36364 0.0203

e.) How would you explain to the managers that commissioned the study how the estimate was computed and why it wasn’t just the number who said “yes” divided by the total number surveyed?

We sampled from the total population using the strata because we though these two groups would respond differently and indeed, they did. Physicians have are twice as likely to be able to make it to the hospital in the event of an emergency as general staff. However, physicians make up a much smaller proportion of the overall hospital workforce and so we need to down weight their responses, relative to general staff in order to ensure their response reflects their distribution in the total population, thus the total estimate of the HMO’s emergency preparedness is much closer to the “general” staff’s strata estimate of \(\frac{1}{3}\).

2.You are conducting a survey of commuting time and means of transport for a large university. What variables are likely to be available and useful for stratifying the sampling?

Probably worth stratifying on “role” at university — student vs. staff vs. professor. Each of these have varying amounts of income available and would likely determine their different means and, consequently, commute time of getting to campus. It might also be worth stratifying on the department of employment for staff and professors, as there can be a wide variability in these measures, again, by department.

3.-4. Skip because of CHIS data issues

  1. In the Academic Performance Index data we saw large gains in precision from stratification on school type when estimating mean or total school size, and no gain when estimating mean Academic performance Index. Would you expect a large or small gain from the following variables: mobility, emer, meals, pcttest? Compare your expectations with the actual results.

The general principle here, is which of these variables do we expect to have some association with the school type. The greater association the more the benefit from stratifying.

  1. For estimating total school enrollment in the Academic Performance Index population, what is the optimal allocation of a total sample size of 200 stratified by school size? Draw a sample with this optimal allocation and compare the standard errors to the stratified sample in Figure 2.5 for: total enrollment, mean 2000 API, mean meals, mean ell.

A first point worth noting is that school size is an integer valued variable and so some grouping will have to be created to define the strata from which schools are then drawn. One possible option is to define the strata as above and below the median school enrollment. Since this divides the population exactly in half the strata are equally sized and the only differentiating factor is the variability of the enrollment.

as_tibble(apipop) %>%
  transmute(
    enroll = enroll,
    strat = if_else(enroll > median(enroll, na.rm = TRUE), "Above", "Below")
  ) %>%
  group_by(strat) %>%
  summarize(sd_enroll = sd(enroll))
## # A tibble: 3 × 2
##   strat sd_enroll
##   <chr>     <dbl>
## 1 Above     504. 
## 2 Below      89.8
## 3 <NA>       NA

Since the variability of the school enrollment sizes in the above median size schools is roughly five times that in the below median size schools, we’d sample from the two strata at a 5:1 ratio, respectively. Or, explicitly, we’d sample 160 schools from the above median school size and 40 schools from the below median school size.

api_opt_strat <- as_tibble(apipop) %>%
  filter(!is.na(enroll)) %>%
  transmute(
    enroll = enroll,
    strat = if_else(enroll > median(enroll, na.rm = TRUE), "Above", "Below")
  )

above <- api_opt_strat %>%
  filter(strat == "Above") %>%
  mutate(fpc = n()) %>%
  slice_sample(n = 160)

below <- api_opt_strat %>%
  filter(strat == "Below") %>%
  mutate(fpc = n()) %>%
  slice_sample(n = 40)

opt_strat_design <- svydesign(
  ids = ~1, strata = ~strat, fpc = ~fpc,
  data = rbind(above, below),
)

svytotal(~enroll, opt_strat_design)
##          total     SE
## enroll 3847981 142095

The mean estimate is slightly closer to the true value, 3811472 but the standard error is slightly larger (119K vs 114K) compared to the previous stratified design. However, the other design sampled 1000s of schools per strata so this is remarkably efficient.

I won’t perform the other comparisons, but one would expect this design to be much less efficient at estimating the other variables if they are not well correlated with enrollment size.

  1. Figure 2.1 shows that the mean school size (enroll) in simple random samples of size 200 from the Academic Performance Index population has close to a Normal distribution.
  1. Construct similar graphs for SRS of size 200, 50, 25, 10.

I’ve already done something like this at the start of the chapter. We’d expect the central limit theorem (the normality of the sample means) to be better for the larger sample sizes listed. Or rather, we might not trust the standard error for the sample size of 10 to really represent the variability of the sample mean.

  1. Repeat for median school size.
pop_median <- median(apipop$enroll)
p1 <- apipop %>%
  ggplot(aes(x = enroll)) +
  geom_histogram() +
  xlab("Student Enrollment") +
  geom_vline(xintercept = pop_median, linetype = 2, color = "red") +
  ggtitle("Distribution of School Enrollment",
    subtitle = "Median Enrollment Identified"
  )

p <- replicate(n = 500, {
  apipop %>%
    sample_n(200) %>%
    pull(enroll) %>%
    median(., na.rm = TRUE)
})

p <- tibble(sample_ix = 1:500, sample_median = p) %>%
  ggplot(aes(x = sample_median)) +
  geom_histogram() +
  xlab("Student Enrollment Medians") +
  geom_vline(
    xintercept = pop_median,
    linetype = 2, color = "red"
  ) +
  ggtitle("Distribution of Sample Medians")
p1 + p

  1. Repeat for mean school size in stratified samples of size 100, 52, 24, 12 using the same stratification proportions (50% elementary, 25% middle schools, 25% high schools) as in the built-in stratified sample.

The same result holds since the samples are independent within and across strata as well.

ns <- c(100, 52, 24, 12)
simdf <- map_dfr(1:200, function(sim_ix) {
  map_dfr(ns, function(n) {
    apipop %>%
      filter(stype == "E") %>%
      mutate(fpc = n()) %>%
      slice_sample(n = n * .5) %>%
      rbind(
        .,
        apipop %>%
          filter(stype != "E") %>%
          group_by(stype) %>%
          mutate(fpc = n()) %>%
          slice_sample(n = n * .25)
      ) %>%
      as_survey_design(strata = stype, fpc = fpc) %>%
      summarize(mean_enroll = survey_mean(enroll, na.rm = TRUE)) %>%
      mutate(sim_ix = sim_ix, n = n)
  })
})
simdf %>%
  ggplot(aes(x = mean_enroll, fill = factor(n))) +
  geom_histogram() +
  theme(legend.title = element_blank()) +
  ggtitle("CLT for stratified samples of different sizes.")

  1. In a design with just two strata write the sample sizes as \(n_1\) and \(n-n_1\) so that there is only one quantity that can be varied. Differentiate the variance of the total with respect to \(n_1\) to find the optimal allocation for two strata. Extend this to any number of strata by using the fact that an optimal allocation cannot be improved by moving samples from one stratum to another stratum.

\[ \hat{V}[\hat{T}] = V_1 + V_2 \\ V_1 = \frac{N_1 - n_1}{N_1} N_1^2 \frac{\sigma^2_1}{n_1} \\ V_2 = \frac{N_2 - (n - n_1)}{N_2} N_2^2 \frac{\sigma^2_2}{n_2} \] taking the derivative …

\[ \frac{d\hat{V}[\hat{T}]}{dn_1} = \frac{dV_1}{dn_1} + \frac{dV_2}{dn_1} \\ = \frac{-N_1^2 \sigma^2_1}{n_1^2} + \frac{N_2^2 \sigma_2^2}{(n - n_1)^2} \] setting to zero and and solving for \(n_1\) we get \[ \frac{-N_1^2 \sigma^2_1}{n_1^2} + \frac{N_2^2 \sigma_2^2}{(n - n_1)^2} = 0 \\ \iff \\ \frac{N_2^2\sigma^2_2}{(n-n_1)^2} = \frac{N_1^2\sigma_1^2}{n_1^2}\\ \iff \\ \frac{n_1^2}{(n-n_1)^2} = \frac{N_1^2\sigma_1^2}{N_2^2\sigma^2_2} \\ \iff \\ \frac{n_1}{(n-n_1)} = \frac{N_1\sigma_1}{N_2\sigma} \\ \iff \\ n_1 = \frac{nN_1\sigma_1}{N_2\sigma_2(1 + \frac{N_1\sigma_1}{N_2\sigma_2})} \\ = \frac{nN_1\sigma_1}{N_2\sigma_2 + N_1\sigma_1} \]

Where the square root is taken over variables constrained to be positive so we only have the positive values as the solution to the equation.

Also, \(N_1, N_2\) are taken to be the population strata sizes and \(\sigma_1, \sigma_2\) are the strata standard deviations.

We can check the second derivative to ensure this is a global optima or we can use the fact that the variance is a quadratic function and is therefore convex. Consequently, there is only one global optima.

Examining the expression - we see the optimal strata size is the variance of each strata weighted by the size of the strata as a fraction of the same value added across all strata, i.e. the population variance.

Here we’ve derived the solution for the “first” strata, but this is arbitrary and there would be a similar value for any strata, for a design with any number of strata.

  1. Write an R function that takes inputs \(n_1, n_2, N_1, N_2, \sigma^2_1, \sigma^2_2\) and computes the variance of the population total in a stratified sample. Choose some reasonable values of the population sizes and variances, and graph this function as \(n_1\) and \(n_2\) change, to find the optimum and to examine how sensitive the variance is the precise values of \(n_1\) and \(n_2\).
strat_var_sample <- function(n_1, n_2, N_1, N_2, sigma_1, sigma_2) {
  one_strat_var <- function(n, N, sigma) {
    (N - n) / N * N^2 * sigma^2 / n
  }
  return(one_strat_var(n_1, N_1, sigma_1) + one_strat_var(n_2, N_2, sigma_2))
}
expand.grid(
  n_1 = seq(2, 100, 10), n_2 = seq(2, 100, 10),
  N_1 = 1E3, N_2 = 1E3, sigma_1 = 1, sigma_2 = 1
) %>%
  as_tibble() %>%
  mutate(
    var = map2_dbl(n_1, n_2, strat_var_sample, unique(N_1), unique(N_2), unique(sigma_1), unique(sigma_2)),
    n_2 = factor(n_2)
  ) %>%
  ggplot(aes(x = n_1, y = var, color = n_2)) +
  geom_point() +
  geom_line() +
  ylab("Variance") +
  ggtitle("Two Strata Design Variance", subtitle = "For Varying Strata Sizes")

I think the point Lumley is after is that the variance doesn’t change dramatically after each sample size is roughly more than ten. This is the quadratic behavior of the variance at play. Obviously the “optimal” variance will be found at the highest \(n_1\) and \(n_2\) as those values will provide the lowest variance. In terms of where the greatest efficiency lies, however, we see the greatest change in the variance estimates after the two sample sizes are greater than 10, as stated previously.

  1. Verify that equation 2.2 gives the HTE of variance for a SRS
  1. Show that when \(i \neq j\) \[ \pi_{ij} = \frac{n}{N}n - 1N - 1 \] This follows from combinatorics. We want the number of combinations in which two (different) items are chosen from a sample of n together, divided by the total number of combinations of n samples from a population of size N. Though, it isn’t clear to me whether the overall sampling design should be with or without replacement - I assume without below.

\[ \frac{n \choose 2}{N\choose n} = \frac{n!}{(n-2)!2!} \frac{(N-n)!n!}{N!} \\ =\frac{n(n-1)n!}{2N(N-1)} \] I don’t see yet how this reduces to the intended form but may return to this later.

  1. Compute \(\pi_{ij} - \pi_i \pi_j\)

  2. Show that the equation in exercise 1.10 (c) reduces to equation 2.2

  3. Suppose instead each individual in the population is independently sampled with probability \(\frac{n}{N}\), so that the sample size \(n\) is not fixed. Show that the finite population correction disappears from equation 2.2 for this Bernoulli sampling design.

Chapter 3: Cluster Sampling

Why Clusters? The NHANES design

Why sample clusters? Because sometimes it’s easier than sampling individuals. Specifically, in cases where the cost of sampling individuals can be quite high, sampling clusters can be more efficient. This is in spite of the fact that within cluster correlation tends to be positive, reducing the information in the sample. Lumley uses the NHANES survey to motivate this idea: moving mobile examination centers all across the country to sample individuals is extremely expensive. By sampling a large number of individuals within a census tract aggregation area the NHANES survey is able to reduce the cost of their effort at a reasonable expense in precision.

Single-stage and multistage designs

Depending on the type of clusters involved it can be easy to sample the entire cluster as classrooms, medical practice and workplaces are, however it is more likely that some subsampling within clusters will be performed for the sake of efficiency. As Lumley notes, clusters in the first stage are called Primary Sampling Units or PSUs. “Stages” refer to the different levels at which sampling occurs. E.g. Sampling individuals within sampled census tracts within a state would involve sampling census tracts in the first stage and then individuals in the second stage. The diagram below communicates this idea graphically.

Sampling weights are determined assuming independence across stages — e.g. if a cluster of houses is sampled with probability \(\pi_1\) and a household is sampled within that cluster with probability \(\pi_2\) then the sampling probability for that house is \(\pi = \pi_1 \times \pi_2\) and it’s weight is the inverse of that probability. Note that this requires that clusters be mutually exclusive - a sampled unit can belong only to one cluster and no others. Further, note that we can still have biased sampling within a stage, as independence is only required across stages to use to find probabilities via their product.

Lumley goes on to describe how cluster sampling and individual sampling can be mixed since each stratum of a survey can be thought of as a separate and independent sample it is trivial to combine single stage sampling in one stratum and multistage sampling in another; a stratified random sample can be used in high density regions where measurement of multiple units is less costly and a cluster sample can be taken in low density regions where the cost of each additional unit is more costly.

The statistical rationale behind this strategy is fairly straightforward — since the variance of the sum is the sum of the variances of each stage (assuming independence) each sampled cluster in a multistage sample can be considered as a population for further sampling. Lumley uses the example of a simplified NHANES design, where 64 regions are grouped into 32 strata. A simple random sample of 440 individuals are then measured in each region. In Lumley’s words,

The variance of an estimated total from this design can be partitioned across two sources: the variance of each estimated regional total around the true total of the region and the variance that would result if the true total for each of the 64 sampled regions were known exactly.

In my own words and understanding, I understand there to be variance that comes from grouping the 64 regions into 32 strata — so there is uncertainty across region and then the uncertainty within that region that results from the sample of only a subset of the population.

Describing Multi Stage Designs to R

In order to specify a single stage cluster sample or a multistage sample treated as a single stage sample with replacement, the main difference is that the PSU identifier needs to be supplied to the id argument, as follows.

# Data originally found at
# "https://github.com/cran/LogisticDx/blob/master/data/nhanes3.rda"
names3 <- load("Data/nhanes/nhanes3.rda")
as_tibble(nhanes3)
## # A tibble: 17,030 × 16
##     SEQN SDPPSU6 SDPSTRA6 WTPFHX6 HSAGEIR HSSEX  DMARACER BMPWTLBS BMPHTIN
##    <dbl>   <dbl>    <dbl>   <dbl>   <dbl> <fct>  <fct>       <dbl>   <dbl>
##  1     3       1       44   1735.      21 male   white        180.    70.4
##  2     4       1       43   1725.      32 female white        136.    63.9
##  3     9       2       43  19452.      48 female white        150.    61.8
##  4    10       1        6  27770.      35 male   white        204.    69.8
##  5    11       2       40   1246.      48 male   white        155.    66.2
##  6    19       1       35   3861.      44 male   black        190.    70.2
##  7    34       1       13   5032.      42 female black        126.    62.6
##  8    44       1        8  28149.      24 female white        123.    64.4
##  9    45       1       22   4582.      67 female black        150.    64.3
## 10    48       1       24  26919.      56 female white        240.    67.6
## # ℹ 17,020 more rows
## # ℹ 7 more variables: PEPMNK1R <dbl>, PEPMNK5R <dbl>, HAR1 <fct>, HAR3 <fct>,
## #   SMOKE <fct>, TCP <dbl>, HBP <fct>
svydesign(
  id = ~SDPPSU6, strat = ~SDPSTRA6,
  weight = ~WTPFHX6,
  ## nest = TRUE indicates the PSU identifier is nested
  ## within stratum - repeated across strata
  nest = TRUE,
  data = nhanes3
)
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (98) clusters.
## svydesign(id = ~SDPPSU6, strat = ~SDPSTRA6, weight = ~WTPFHX6, 
##     nest = TRUE, data = nhanes3)

SDPPSU6 is the pseudo PSU variable, and SDPSTRA6 is the stratum identifier defined for the single stage analysis.

For example, a two stage design for the API population that samples 40 school districts, then five schools within each district , the design has population size 757 at the first stage for the number of school districts in CA and the number of schools within each district for the second stage. The weights need not be supplied if they can be worked out from the other arguments.

data(api)
as_tibble(apiclus2)
## # A tibble: 126 × 40
##    cds       stype name  sname  snum dname  dnum cname  cnum  flag pcttest api00
##    <chr>     <fct> <chr> <chr> <dbl> <chr> <int> <chr> <int> <int>   <int> <int>
##  1 31667796… E     Alta… Alta…  3269 Alta…    15 Plac…    30    NA     100   821
##  2 55751846… E     Tena… Tena…  5979 Big …    63 Tuol…    54    NA     100   773
##  3 41688746… E     Pano… Pano…  4958 Bris…    83 San …    40    NA      98   600
##  4 41688746… M     Lipm… Lipm…  4957 Bris…    83 San …    40    NA     100   740
##  5 41688746… E     Bris… Bris…  4956 Bris…    83 San …    40    NA      98   716
##  6 40687266… E     Cayu… Cayu…  4915 Cayu…   117 San …    39    NA     100   811
##  7 20651936… E     Full… Full…  2548 Chow…   132 Made…    19    NA     100   472
##  8 20651936… E     Fair… Fair…  2550 Chow…   132 Made…    19    NA     100   520
##  9 20651936… M     Wils… Wils…  2549 Chow…   132 Made…    19    NA     100   568
## 10 06615980… H     Colu… Colu…   348 Colu…   152 Colu…     5    NA      96   591
## # ℹ 116 more rows
## # ℹ 28 more variables: api99 <int>, target <int>, growth <int>, sch.wide <fct>,
## #   comp.imp <fct>, both <fct>, awards <fct>, meals <int>, ell <int>,
## #   yr.rnd <fct>, mobility <int>, acs.k3 <int>, acs.46 <int>, acs.core <int>,
## #   pct.resp <int>, not.hsg <int>, hsg <int>, some.col <int>, col.grad <int>,
## #   grad.sch <int>, avg.ed <dbl>, full <int>, emer <int>, enroll <int>,
## #   api.stu <int>, pw <dbl>, fpc1 <dbl>, fpc2 <int[1d]>
## dnum = district id
## snum = school id
## fpc1 = school id number
clus1_design <- svydesign(id = ~dnum, fpc = ~fpc, data = apiclus1)
clus2_design <- svydesign(
  id = ~ dnum + snum, fpc = ~ fpc1 + fpc2,
  data = apiclus2
)
clus2_design
## 2 - level Cluster Sampling design
## With (40, 126) clusters.
## svydesign(id = ~dnum + snum, fpc = ~fpc1 + fpc2, data = apiclus2)

Strata with only one PSU

When only one PSU exists within a population stratum, the sampling fraction must be 100%, since otherwise it would be 0%. In this case, the stratum does not contribute to the first stage variance and it should be ignored in calculating the first stage variance. Lumley argues that the best way to handle a stratum with only one PSU is to combine it with another stratum, one that is chosen to be similar based on population data available before the study was done. The survey package has two different methods implemented to handle “lonely” PSU’s. Lumley has written further on this topic here.

How good is the single-stage approximation?

Here Lumley walks through an example detailing the trade-offs involved in using the single stage approximation. I’ll try to come up with a simulated example later as the data is not listed on the book’s website nor is it clear how to reassemble his dataset from the files at the NHIS site.

Sampling by Size

Why do white sheep eat more than black sheep? There are more white sheep than black sheep

A specific design theory, Probability-proportional-to-size (PPS), cluster sampling is a sampling strategy that exploits the fact that for a simple random sample of an unstratified population \(\pi_i\) can be chosen such that it is approximately proportional to \(X_i\), the variable of interest, the resulting variance of the estimate of the total \(V[\hat{T}] = \frac{N-n}{N} N^{2} \frac{V[X]}{n}\) can then be controlled to be quite small. These are the same ideas I discussed at the start of the notes but more discussion on this topic can be found in (Tillé 2006; Hanif and Brewer 1980).

data(election)
election <- as_tibble(election) %>%
  mutate(
    votes = Bush + Kerry + Nader,
    p = 40 * votes / sum(votes)
  )
election %>%
  ggplot(aes(x = Kerry, y = Bush)) +
  geom_point() +
  scale_y_log10() +
  scale_x_log10() +
  ggtitle("Correlation in Voting Totals from US 2004 Presidential Election",
    subtitle = "Both x and y axes are on log 10 scales."
  )

When Lumley’s book was written, only the single stage approximation of PPS could be analyzed using the survey package. A demo is shown below using the voting data, where a PPS sample is constructed and then analyzed.

data(election)
election <- as_tibble(election) %>%
  mutate(
    votes = Bush + Kerry + Nader,
    p = 40 * votes / sum(votes)
  )
election
## # A tibble: 4,600 × 8
##    County   TotPrecincts PrecinctsReporting   Bush Kerry Nader  votes       p
##    <fct>           <int>              <int>  <int> <int> <int>  <int>   <dbl>
##  1 Alaska            439                439 151876 86064  3890 241830 0.0832 
##  2 Autauga            22                 22  15212  4774    74  20060 0.00691
##  3 Baldwin            50                 50  52910 15579   371  68860 0.0237 
##  4 Barbour            24                 24   5893  4826    26  10745 0.00370
##  5 Bibb               16                 16   5471  2089    12   7572 0.00261
##  6 Blount             26                 26  17364  3932    92  21388 0.00736
##  7 Bullock            27                 27   1494  3210     3   4707 0.00162
##  8 Butler             27                 27   4978  3409    13   8400 0.00289
##  9 Calhoun            53                 53  29806 15076   182  45064 0.0155 
## 10 Chambers           24                 24   7618  5346    42  13006 0.00448
## # ℹ 4,590 more rows
insample <- sampling::UPtille(election$p)
ppsample <- election[insample == 1, ]
ppsample$wt <- 1 / ppsample$p
pps_design <- svydesign(id = ~1, weight = ~wt, data = ppsample)
pps_design
## Independent Sampling design (with replacement)
## svydesign(id = ~1, weight = ~wt, data = ppsample)
svytotal(~ Bush + Kerry + Nader, pps_design, deff = TRUE)
##          total       SE   DEff
## Bush  61837462  2905144 0.0084
## Kerry 53869608  2892749 0.0040
## Nader   492036   105989 0.0388

Loss of information from sampling clusters

The loss of precision per observation from cluster sampling is given by the design effect.

“For a single-stage cluster sample with all clusters having the same number of individuals, \(m\), the design effect is

\[ DEff = 1 + (m-1)\rho, \]

where \(\rho\) is the within-cluster correlation.

Lumley illustrates how design effects can illustrate the impact on inference using the California school data set from before as well as the Behavioral Risk Factor Surveillance System from 2007.

svymean(~ api00 + meals + ell + enroll, clus1_design, deff = TRUE)
##            mean       SE    DEff
## api00  644.1694  23.5422  9.2531
## meals   50.5355   6.2690 10.3479
## ell     27.6120   2.0193  2.6711
## enroll 549.7158  45.1914  2.7949

In the above, the variance is up to 10 times higher in the cluster sample as compared to a simple random sample.

## Lumley renames clus2_design to dclus2 from before. I maintain the same names.
svymean(~ api00 + meals + ell + enroll, clus2_design, deff = TRUE, na.rm = TRUE)
##            mean       SE    DEff
## api00  673.0943  31.0574  6.2833
## meals   52.1547  10.8368 11.8585
## ell     26.0128   5.9533  9.4751
## enroll 526.2626  80.3410  6.1427

These values increase slightly for all measures except api00 in the two stage cluster sampling design. Lumley points out that these large design effects demonstrate how variable the measures of interest are between cluster, suggesting that the sampling of clusters, while efficient economically are not as efficient statistically.

Similarly, when computing the proportion of individuals who have more than 5 servings of fruits and vegetables a day (X_FV5SRV = 2), as well as how often individuals received a cholesterol test in the past 5 years (X_CHOLCHK = 1) from the 2007 Behavioral Risk Factor Surveillance System dataset, we see design effects that reflect the geographic variability across the blocks of telephone numbers that were sampled for the survey.

brfss <- svydesign(
  id = ~X_PSU, strata = ~X_STATE, weight = ~X_FINALWT,
  data = "brfss", dbtype = "SQLite",
  dbname = "data/BRFSS/brfss07.db", nest = TRUE
)
brfss
## DB-backed Stratified Independent Sampling design (with replacement)
## svydesign(id = ~X_PSU, strata = ~X_STATE, weight = ~X_FINALWT, 
##     data = "brfss", dbtype = "SQLite", dbname = "data/BRFSS/brfss07.db", 
##     nest = TRUE)
food_labels <- c("Yes Veg", "No Veg")
chol_labels <- c("within 5 years", ">5 years", "never")
svymean(
  ~ factor(X_FV5SRV) +
    factor(X_CHOLCHK),
  brfss,
  deff = TRUE
)
##                          mean         SE   DEff
## factor(X_FV5SRV)1  0.73096960 0.00153359 5.1632
## factor(X_FV5SRV)2  0.23844253 0.00145234 5.0147
## factor(X_FV5SRV)9  0.03058787 0.00069991 7.1323
## factor(X_CHOLCHK)1 0.73870300 0.00168562 6.3550
## factor(X_CHOLCHK)2 0.03230828 0.00058759 4.7676
## factor(X_CHOLCHK)3 0.19989559 0.00162088 7.0918
## factor(X_CHOLCHK)9 0.02909313 0.00055471 4.7029

Repeated Measurements

Lumley notes that design based inference continues to differ from model based in its analysis of repeated measurements. Where model based inference is careful to account for modeling the – for example – within person or within household correlation in a cohort study, no such adjustment is required in a designed survey other than adjusting and using the appropriate weights - treating the repeated measurement like another stage of clustering in the sampling.

Lumley illustrates this with the Survey of Income and Program Participation (SIPP) panel survey.

Each panel is followed for multiple years, with subsets of the panel participating in four month waves of follow-up… wave 1 of the 1996 panel, which followed 36,730 households with interviews every four months, starting in late 1995 or early 1996… The households were recruited in a two-stage sample. The first stage sampled 322 counties or groups of counties as PSUs; the second stage sampled households within these PSUs.

Lumley demonstrates how to estimate repeated measures with panel data using the survey package via the code below. 5 quantiles are estimated across the population and across the 8 months. When Lumley mentions that there is no need for adjusting for correlation in the blockquote above, I believe he is referring to the within-month point estimates. If we were to try and estimate the change in, say, income as a function of other covariates I believe we would want to adjust for correlation in order to get the appropriate standard errors. For the point estimates Lumley points out that the weights are exactly as required for the per-month estimate, but would need to be divided by the number of samples when totaling across the number of measurements. Proportions or regressions are invariant to this scaling factor so no adjustment is needed there.

sipp_hh <- svydesign(
  id = ~ghlfsam, strata = ~gvarstr, nest = TRUE,
  weight = ~whfnwgt, data = "household", dbtype = "SQLite",
  dbname = "Data/SIPP/sipp.db"
)
sipp_hh <- update(sipp_hh,
  month = factor(rhcalmn,
    levels = c(12, 1, 2, 3, 4, 5, 6),
    labels = c(
      "Dec", "Jan", "Feb", "Mar",
      "Apr", "May", "Jun"
    )
  )
)
qinc <- svyby(~thtotinc, ~month, svyquantile,
  design = sipp_hh,
  quantiles = c(0.25, 0.5, 0.75, 0.9, 0.95), se = TRUE
)
pltdf <- as_tibble(qinc) %>%
  select(month, contains("thtotinc"), -contains("se")) %>%
  gather(everything(), -month, key = "quantile", value = "Total Income") %>%
  mutate(quantile = as.numeric(str_extract(quantile, "[0-9].[0-9]?[0-9]")) * 100)

se <- as_tibble(qinc) %>%
  select(month, contains("se")) %>%
  gather(everything(), -month, key = "quantile", value = "SE") %>%
  mutate(quantile = as.numeric(str_extract(quantile, "[0-9].[0-9]?[0-9]")) * 100)

pltdf <- pltdf %>%
  left_join(se) %>%
  mutate(
    lower = `Total Income` - 2 * SE,
    upper = `Total Income` + 2 * SE
  )
pltdf %>%
  mutate(quantile = factor(quantile)) %>%
  ggplot(aes(x = month, y = `Total Income`, color = quantile)) +
  geom_pointrange(aes(ymin = lower, ymax = upper)) +
  xlab("Month in 1995/1996") +
  ylab("Total Income (USD)") +
  ggtitle("Total Income Quantiles",
    subtitle = "Survey of Income and Program Participation"
  )

Lumley notes here that we’re only talking about correlation at all here because the same individuals are being measured across time. In a study like NHANES that recruits different individuals each year, we can effectively assume the samples are independent since the overall population si so big.

Estimation would be much more complicated if the samples overlapped in some way.

Questions

  1. The web site has data files demo_x.xpt of demographic data and bpx_c.xpt of blood pressure data from NHANES 2003-2004. Code to load and merge these data sets is in Appendix B, in Figure B.1
  1. Construct a survey design object with these data.
## data are from the book website:
# https://r-survey.r-forge.r-project.org/svybook/
# demographic data
ddf <- haven::read_xpt("data/nhanesxpt/demo_c.xpt")
# blood pressure data
bpdf <- haven::read_xpt("data/nhanesxpt/bpx_c.xpt")

bpdf <- merge(ddf, bpdf, by = "SEQN", all = FALSE) %>%
  mutate(
    sys_over_140 = (BPXSAR > 140) * 1,
    dia_over_90 = (BPXDAR > 90) * 1,
    hbp = (sys_over_140 + dia_over_90 == 2) * 1,
    age_group = cut(RIDAGEYR, c(0, 21, 65, Inf)),
    sex = factor(RIAGENDR, levels = 1:2, labels = c("Male", "Female"))
  )

bpdf_design <- bpdf %>%
  select(
    sys_over_140, dia_over_90, hbp, WTMEC2YR, SDMVPSU, SDMVSTRA,
    age_group, sex, RIDAGEYR, BPXSAR, BPXDAR, RIAGENDR, RIDAGEMN
  ) %>%
  as_survey_design(
    weights = WTMEC2YR,
    id = SDMVPSU,
    strata = SDMVSTRA,
    nest = TRUE
  )
bpdf_design
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (30) clusters.
## Called via srvyr
## Sampling variables:
##  - ids: SDMVPSU
##  - strata: SDMVSTRA
##  - weights: WTMEC2YR
## Data variables: sys_over_140 (dbl), dia_over_90 (dbl), hbp (dbl), WTMEC2YR
##   (dbl), SDMVPSU (dbl), SDMVSTRA (dbl), age_group (fct), sex (fct), RIDAGEYR
##   (dbl), BPXSAR (dbl), BPXDAR (dbl), RIAGENDR (dbl), RIDAGEMN (dbl)
  1. Estimate the proportion of the US population with systolic blood pressure over 140 mm HG, with diastolic blood pressure over 90 mm Hg, and with both.

Lumley doesn’t tell us which variable name corresponds to the measurements indicated and I didn’t see any documentation in the files included so I just guess here.

bpdf_design %>%
  summarize(
    prop_hbp_one =
      survey_ratio(sys_over_140, n(), na.rm = TRUE, vartype = "ci", deff = TRUE),
    prop_hbp_two =
      survey_ratio(sys_over_140, n(), na.rm = TRUE, vartype = "ci", deff = TRUE),
    prob_hbp = survey_ratio(hbp, n(), na.rm = TRUE, vartype = "ci", deff = TRUE)
  ) %>%
  gather(everything(), key = "metric", value = "value")
## # A tibble: 12 × 2
##    metric                 value
##    <chr>                  <dbl>
##  1 prop_hbp_one      0.0000119 
##  2 prop_hbp_one_low  0.0000104 
##  3 prop_hbp_one_upp  0.0000135 
##  4 prop_hbp_one_deff 3.60      
##  5 prop_hbp_two      0.0000119 
##  6 prop_hbp_two_low  0.0000104 
##  7 prop_hbp_two_upp  0.0000135 
##  8 prop_hbp_two_deff 3.60      
##  9 prob_hbp          0.00000217
## 10 prob_hbp_low      0.00000141
## 11 prob_hbp_upp      0.00000293
## 12 prob_hbp_deff     4.15
  1. Estimate the design effects for these proportions.

Included in the output above.

  1. How do these proportions vary by age (RIDAGEYR) and gender (RIAGENDR)
nhanes_ests <- bpdf_design %>%
  group_by(sex, age_group) %>%
  summarize(
    prop_hbp_one =
      survey_ratio(sys_over_140, n(), na.rm = TRUE, vartype = "ci", deff = TRUE),
    prop_hbp_two =
      survey_ratio(sys_over_140, n(), na.rm = TRUE, vartype = "ci", deff = TRUE),
    prob_hbp = survey_ratio(hbp, n(), na.rm = TRUE, vartype = "ci", deff = TRUE)
  ) %>%
  ungroup()
nhanes_ests
## # A tibble: 8 × 14
##   sex    age_group  prop_hbp_one prop_hbp_one_low prop_hbp_one_upp
##   <fct>  <fct>             <dbl>            <dbl>            <dbl>
## 1 Male   (0,21]      0.000000953     -0.000000551       0.00000246
## 2 Male   (21,65]     0.0000647        0.0000486         0.0000809 
## 3 Male   (65,Inf]    0.000478         0.000402          0.000554  
## 4 Male   <NA>      NaN              NaN               NaN         
## 5 Female (0,21]      0.000000660     -0.000000521       0.00000184
## 6 Female (21,65]     0.0000597        0.0000518         0.0000677 
## 7 Female (65,Inf]    0.000738         0.000655          0.000820  
## 8 Female <NA>      NaN              NaN               NaN         
## # ℹ 9 more variables: prop_hbp_one_deff <dbl>, prop_hbp_two <dbl>,
## #   prop_hbp_two_low <dbl>, prop_hbp_two_upp <dbl>, prop_hbp_two_deff <dbl>,
## #   prob_hbp <dbl>, prob_hbp_low <dbl>, prob_hbp_upp <dbl>, prob_hbp_deff <dbl>

This produces a bunch of output, so I plot the high blood pressure result — both high systolic and diastolic blood pressure – below.

nhanes_ests %>%
  select(age_group, sex, prob_hbp, prob_hbp_low, prob_hbp_upp) %>%
  filter(!is.na(age_group)) %>%
  ggplot(aes(x = age_group, y = prob_hbp, color = sex)) +
  geom_pointrange(aes(ymin = prob_hbp_low, ymax = prob_hbp_upp),
    position = position_dodge(width = .25)
  ) +
  ylab("% US Population with High Blood Pressure") +
  xlab("Age Group") +
  ggtitle("Prevalence of High Blood Pressure in U.S. Across Age and Sex",
    subtitle = "Data from NHANES 2003-2004"
  ) +
  scale_y_continuous(labels = scales::percent)

3.2 Repeat the sampling and estimation in Figure 3.4 1000 times.

FYI the sampling::UPtille function takes a while so running this simulation 1000 times takes a while.

OneSimulation <- function() {
  insample <- sampling::UPtille(election$p)
  ppsample <- election[insample == 1, ]
  ppsample$wt <- 1 / ppsample$p
  pps_design <- svydesign(id = ~1, weight = ~wt, data = ppsample, pps = "brewer")
  total <- svytotal(~ Bush + Kerry + Nader, pps_design)
  std_err <- sqrt(diag(attr(total, "var")))
  ci <- confint(total)
  out <- cbind(total, std_err, ci)
}
total_sims <- replicate(10, OneSimulation())
  1. Check that the mean of the estimated totals is close to the true population totals.
estimated_totals <- as_tibble(t(rowMeans(total_sims[, 1, ]))) %>%
  mutate(label = "Estimate")

true_totals <- as_tibble(t(colSums(election[, c("Bush", "Kerry", "Nader")]))) %>%
  mutate(label = "Truth")

rbind(estimated_totals, true_totals) %>%
  gather(everything(), -label, key = "Candidate", value = "Vote Count") %>%
  spread(label, `Vote Count`) %>%
  gt::gt() %>%
  gt::fmt_scientific() %>%
  gt::tab_header("Simulated Total Comparison")
Simulated Total Comparison
Candidate Estimate Truth
Bush 5.97 × 107 5.96 × 107
Kerry 5.61 × 107 5.61 × 107
Nader 4.18 × 105 4.04 × 105

These are quite close.

b). Compute the mean of the estimated standard errors and compare it to the true simulation standard error, that is, the standard deviation of the estimated totals.

estimated_stderrs <- as_tibble(t(rowMeans(total_sims[, 2, ]))) %>%
  mutate(label = "Estimate")

true_stderr <- as_tibble(apply(total_sims, 1, sd)) %>%
  mutate(label = "Truth", Candidate = c("Bush","Kerry", "Nader")) %>% 
  spread(Candidate,value)

rbind(estimated_stderrs, true_stderr) %>%
  gather(everything(), -label, key = "Candidate", value = "Vote Std.Err") %>%
  spread(label, `Vote Std.Err`) %>%
  gt::gt() %>%
  gt::fmt_scientific() %>%
  gt::tab_header("Simulated Standard Error Comparison")

Same as above.

  1. Estimate 95% confidence intervals for the population totals and compute the proportion of intervals that contain the true population value.
totals <- unlist(true_totals[, 1:3])
prop_contained <- rowMeans((total_sims[, 3, ] < totals &
  total_sims[, 4, ] > totals) * 1)
prop_contained

We see that the proportion contained by the interval is near the 95% nominal value.

3.3 The National Longitudinal Study of Youth is documented at www.nlsinfo.org/nlsy97/nlsdocs/nlsy97/maintoc.html

  1. What are the stages of sampling and the sampling units at each stage?

From the website the sampling occurred in 2 phases (see image below).

The first phase screened households and the second identified eligible respondents. I think this would technically be called a two phase sample (discussed in chapter 8) but given the population sizes involved it may be that there was no need to account for the potential dependence here.

The sampling unit in the first stage came from NORC’s 1990 sampling design which used Standard Metropolitan Statistical Areas or non-metropolitan counties. That is, according to the general social survey documentation

Subsequent sampling units are segments, households, and then members of households.

  1. What would the strata and PSUs be for the single stage approximation to the design?

I don’t know. It isn’t clear how to reconstruct Lumley’s example given that the data is not available. Looking at the NHIS website it looks like the strata and PSU information for a single stage approximation are given. When looking for the same information for the NLYS data I don’t see equivalent instructions on how to construct a single stage approximation. His description in the book here isn’t very forthcoming either but I’d guess that one could just concatenate the PSU and SSU labels and the same for the strata and then use those in a one stage design. This doesn’t look exactly like what Lumley does in his example.

3.4 The British Household Panel Survey (BHPS) is documented at http://www.iser.essex.ac.uk/survey/bhps (this website is no longer accurate) .

  1. What are the stages of sampling, strata and the sampling units at each stage?

The website given is no longer accurate. Clicking through the documentation of the “Understanding Society” website it looks like the BHPS has been combined with other surveys. When I look at more documentation, (1, 2, 3), the last of these being the most pertinent, I see a variety of designs by wave. It looks like the design was a two stage stratified sample where the sampling frame was the Postcode Address File for Great Britian, excluding Northern Ireland. 250 postcodes were chosen as the primary sampling unit, in the second stage, “delivery points” which are roughly equivalent to addresses were then selected.

  1. What would the strata and PSUs be for the single stage approximation to the design?

Again, my best guess is to concatenate the postcodes and delivery addresses to construct the single stage identifiers, and the same for the strata.

3.5 Statistics New Zealand lists survey topics at http://www.stats.govt.nz/datasets/a-z-list.htm . Find the Household Labour Force Survey.

  1. What are the stages of sampling and the sampling units at each stage?

Again, that site is no longer valid. The current help page for the household labor survey simply says,

Approximately fifteen thousand (15,000) households take part in this survey every three months. A house is selected using a fair statistical method to ensure the sample is an accurate representation of New Zealand. Every person aged 15 years or over living in a selected house needs to take part.

A different doc I found on google searching listed the 2016 survey design as a 2 stage design where samples are drawn from strata in the first stage. The PSU’s are “meshblocks” which appear to be the NZ equivalent of a U.S. census block / tract.

  1. What would the strata and PSUs be for the single stage approximation to the design?

Same answer here as previously.

3.6 This exercise uses the Washington State Crime data for 2004 as the population. The data consists of crime rates and population size for the police districts (in cities/towns) and sheriffs’ offices (in unincorporated areas), grouped by county.

I took the data from this website. Specifically, this excel sheet. One of the tricky things about using this data was the fact that several police districts are reported as having zero associated population. I removed these from the dataset to make things simpler.

  1. Take a simple random sample of 10 counties from the state and use all the data from the sampled counties. Estimate the total number of murders and burglaries in the state.
# data from
wa_crime_df <- readxl::read_xlsx("data/WA_crime/1984-2011.xlsx", skip = 4) %>%
  filter(Year == "2004", Population > 0) %>%
  mutate(
    murder_and_crime = `Murder Total` + `Burglary Total`,
    violent_crime = `Violent Crime Total`,
    property_crime = `Property Crime Total`,
    state_pop = sum(Population),
    County = stringr::str_to_lower(County),
    num_counties = n_distinct(County),
  ) %>%
  group_by(County) %>%
  mutate(num_agencies = n_distinct(Agency)) %>%
  ungroup() %>%
  select(
    County, Agency, Population, murder_and_crime, 
    violent_crime, property_crime,
    num_counties, num_agencies
  )

true_count <- sum(wa_crime_df$murder_and_crime)
county_list <- unique(wa_crime_df$County)
county_sample <- sample(county_list, 10)
part_a <- wa_crime_df %>%
  filter(County %in% county_sample) %>%
  as_survey_design(
    ids = c(County, Agency),
    fpc = c(num_counties, num_agencies)
  ) %>%
  summarize(total = survey_total(murder_and_crime)) %>%
  mutate(Q = "a")
part_a
## # A tibble: 1 × 3
##    total total_se Q    
##    <dbl>    <dbl> <chr>
## 1 139230   61180. a
  1. Stratify the sampling so that King County is sampled with 100% probability together with a simple random sample of five other counties. Estimate the total number of of murders and burglaries in the state and compare to the previous estimates.
county_sample <- sample(county_list[-which(county_list == "king")], 5)
county_sample <- c(county_sample, "king")
part_b <- wa_crime_df %>%
  filter(County %in% county_sample) %>%
  mutate(
    strata_label = if_else(County == "king", "strata 1", "strata 2"),
    num_counties = if_else(County == "king", 1, length(county_list) - 1)
  ) %>%
  as_survey_design(
    ids = c(County, Agency),
    fpc = c(num_counties, num_agencies),
    strata = strata_label
  ) %>%
  summarize(
    total = survey_total(murder_and_crime)
  ) %>%
  mutate(Q = "b")
part_b
## # A tibble: 1 × 3
##   total total_se Q    
##   <dbl>    <dbl> <chr>
## 1 36548   11157. b
  1. Take simple random samples of five police districts from King County and five counties from the rest of the state. Estimate the total number of murders and burglaries in the state and compare to the previous estimates.

This is a stratified two-stage sample design with no uncertainty in the first stage in one (king county) strata, and no uncertainty in the second stage in the other (non-king counties) strata.

king_districts <- wa_crime_df %>%
  filter(County == "king") %>%
  pull(Agency)
sampled_king_districts <- sample(king_districts, 5)
sampled_counties <- sample(county_list, 5)

part_c <- wa_crime_df %>%
  filter(County %in% sampled_counties | Agency %in% sampled_king_districts) %>%
  mutate(
    strata_label = if_else(County == "king", "King County", "WA Counties"),
    num_counties = if_else(County == "king", 1, length(county_list) - 1),
  ) %>%
  as_survey_design(
    id = c(County, Agency),
    fpc = c(num_counties, num_agencies),
    strata = strata_label
  ) %>%
  summarize(total = survey_total(murder_and_crime, vartype = "se")) %>%
  mutate(Q = "c")
part_c
## # A tibble: 1 × 3
##   total total_se Q    
##   <dbl>    <dbl> <chr>
## 1 62090   34682. c
  1. Take a probability proportional to size (PPS) sample of 10 police districts. Estimate the total number of murders and burglaries in the state and compare to the previous estimates.
pi <- sampling::inclusionprobabilities(a = wa_crime_df$Population, n = 10)
part_d <- wa_crime_df %>%
  mutate(
    pi = pi,
    in_sample = sampling::UPbrewer(pi)
  ) %>%
  filter(in_sample == 1) %>%
  as_survey_design(probs = pi) %>%
  summarize(total = survey_total(murder_and_crime)) %>%
  mutate(Q = "d")
part_d
## # A tibble: 1 × 3
##    total total_se Q    
##    <dbl>    <dbl> <chr>
## 1 52879.    6179. d

This has the lowest standard error yet, likely since the sampling was specifically designed to capture high density districts.

  1. Take a simple random sample of counties and include all the police districts. Estimate the total number of murders and burglaries in the state and compare to the previous estimates.
county_sample <- sample(county_list, 10)
part_e <- wa_crime_df %>%
  filter(County %in% county_sample) %>%
  as_survey_design(
    ids = c(County, Agency),
    fpc = c(num_counties, num_agencies)
  ) %>%
  summarize(total = survey_total(murder_and_crime)) %>%
  mutate(Q = "e")
part_e
## # A tibble: 1 × 3
##    total total_se Q    
##    <dbl>    <dbl> <chr>
## 1 34414.    9809. e

I’ll compare the standard error of all the estimates now.

rbind(part_a, part_b, part_c, part_d, part_e) %>%
  select(Q, total_se) %>%
  gt::gt()
Q total_se
a 61179.512
b 11156.923
c 34682.234
d 6178.739
e 9809.294

We see the highest variance in estimates from the simple random samples - Q’s a) and e). We see the lowest variance from part d) which used the proportional to size sampling scheme. Finally, we see the two stratified multi-stage samples have variances in between, all of which makes sense.

3.7 Use the household data from the 1966 SIP panel to estimate the 25th, 50th, 75th, 90th and 95th percentiles of income for households of different sizes (ehhnumpp) averaged over the fourth months. You will want to recode the large values of ehhnumpp to a a single category. Describe the patterns you see.

For this question we effectively copy the code from the “Repeated Measures” section, looking at quantile by a recoded household size now, instead of month.

sipp_hh <- update(sipp_hh,
  household_size = factor(
    case_when(
      ehhnumpp <= 8 ~ as.character(ehhnumpp),
      TRUE ~ ">=9"
    ),
    levels = c(as.character(1:8), ">=9")
  )
)
qinc <- svyby(~thtotinc, ~household_size, svyquantile,
  design = sipp_hh,
  quantiles = c(0.25, 0.5, 0.75, 0.9, 0.95), se = TRUE
)
pltdf <- as_tibble(qinc) %>%
  select(household_size, contains("thtotinc"), -contains("se.")) %>%
  gather(everything(), -household_size, key = "quantile", value = "Total Income") %>%
  mutate(quantile = as.numeric(str_extract(quantile, "[0-9].[0-9]?[0-9]")) * 100)

se <- as_tibble(qinc) %>%
  select(household_size, contains("se.")) %>%
  gather(everything(), -household_size, key = "quantile", value = "SE") %>%
  mutate(quantile = as.numeric(str_extract(quantile, "[0-9].[0-9]?[0-9]")) * 100)

pltdf <- pltdf %>%
  left_join(se) %>%
  mutate(
    lower = `Total Income` - 2 * SE,
    upper = `Total Income` + 2 * SE
  )

pltdf %>%
  mutate(quantile = factor(quantile)) %>%
  ggplot(aes(x = household_size, y = `Total Income`, color = quantile)) +
  geom_pointrange(aes(ymin = lower, ymax = upper)) +
  xlab("Household Size") +
  ylab("Total Income (USD)") +
  ggtitle("Total Income Quantiles",
    subtitle = "Survey of Income and Program Participation"
  )

In the plot above we can see that the income variability gets wider as household size increases but then plateaus at around ~ 5. Additionally, there are few samples with very large households so estimating the quantiles for those groups is increasingly noisy.

3.8 In the data from the 1996 SIPP panel a) What proportion of households received any “means-tested cash benefits” (thtrninc)? For those households that did receive benefits what mean proportion of their total income came from these benefits?

db <- DBI::dbConnect(RSQLite::SQLite(), "Data/SIPP/sipp.db")
sipp_df <- tbl(db, sql("SELECT * FROM household")) %>%
  dplyr::collect() %>%
  select(ghlfsam, gvarstr, whfnwgt, thtotinc, thtrninc, tmthrnt) %>%
  mutate(
    benefit_recipient = I(thtrninc > 0) * 1,
    thtrninc = as.numeric(thtrninc),
    tmthrnt = as.numeric(tmthrnt)
  )

sipp_hh_sub <- sipp_df %>%
  as_survey_design(
    id = ghlfsam, strata = gvarstr, nest = TRUE,
    weight = whfnwgt
  )
sipp_hh_sub %>%
  summarize(prop_benefit_recipients = survey_mean(benefit_recipient))
## # A tibble: 1 × 2
##   prop_benefit_recipients prop_benefit_recipients_se
##                     <dbl>                      <dbl>
## 1                  0.0829                    0.00174
sipp_hh_sub %>%
  filter(benefit_recipient == 1) %>%
  mutate(prop_benefit_income = thtrninc / thtotinc) %>%
  summarize(
    mn_prop_benefit_income = survey_mean(prop_benefit_income)
  )
## # A tibble: 1 × 2
##   mn_prop_benefit_income mn_prop_benefit_income_se
##                    <dbl>                     <dbl>
## 1                  0.497                   0.00710
  1. What proportion of households paid rent tmthrnt? What were the mean and the 75th and 95th percentiles of the proportion of monthly income paid in rent? What proportion paid more than one third of their income in rent?
sipp_hh_sub %>%
  mutate(
    pays_rent = I(as.numeric(tmthrnt) > 0) * 1,
    pct_income_rent = (tmthrnt / thtotinc) * 100,
    rent_pct_gt_thrd = (pct_income_rent > 0.33) * 1
  ) %>%
  # avoid divide by zero error
  filter(thtotinc >0) %>% 
  summarize(
    prop_pays_rent = survey_mean(pays_rent),
    mean_pct_income_rent = survey_mean(pct_income_rent, na.rm = TRUE),
    quantile_pct_income_rent = survey_quantile(pct_income_rent, 
                                                   na.rm = TRUE, 
                                                   quantiles = c(.75,.95)),
    prop_rent_gt_thrd = survey_mean(rent_pct_gt_thrd, na.rm = TRUE)
  ) %>% 
  pivot_longer(cols = everything(), names_to = "Rent", values_to = "Estimates")
## # A tibble: 10 × 2
##    Rent                            Estimates
##    <chr>                               <dbl>
##  1 prop_pays_rent                    0.0401 
##  2 prop_pays_rent_se                 0.00111
##  3 mean_pct_income_rent              2.24   
##  4 mean_pct_income_rent_se           0.633  
##  5 quantile_pct_income_rent_q75      0      
##  6 quantile_pct_income_rent_q95      0      
##  7 quantile_pct_income_rent_q75_se   1.42   
##  8 quantile_pct_income_rent_q95_se   1.42   
##  9 prop_rent_gt_thrd                 0.0400 
## 10 prop_rent_gt_thrd_se              0.00111

3.9 By the time you read this, the survey package is likely to provide some approximations for PPS designs that use the finite population size information. Repeat 3.2 using these.

OneSimulation <- function() {
  insample <- sampling::UPtille(election$p)
  ppsample <- election[insample == 1, ]
  ppsample$wt <- 1 / ppsample$p
  pps_design <- svydesign(id = ~1, weight = ~wt, data = ppsample, pps = "brewer")
  total <- svytotal(~ Bush + Kerry + Nader, pps_design)
  std_err <- sqrt(diag(attr(total, "var")))
  ci <- confint(total)
  out <- cbind(total, std_err, ci)
}
total_sims <- replicate(10, OneSimulation())
  1. Check that the mean of the estimated totals is close to the true population totals.
OneSimulation <- function() {
  insample <- sampling::UPtille(election$p)
  ppsample <- election[insample == 1, ]
  ppsample$wt <- 1 / ppsample$p
  ppsample$fpc <- 40 / sum(election$votes)
  pps_design <- svydesign(id = ~1, weight = ~wt, fpc = ~fpc, data = ppsample, pps = "brewer")
  total <- svytotal(~ Bush + Kerry + Nader, pps_design)
  std_err <- sqrt(diag(attr(total, "var")))
  ci <- confint(total)
  out <- cbind(total, std_err, ci)
}
total_sims <- replicate(10, OneSimulation())
estimated_totals <- as_tibble(t(rowMeans(total_sims[, 1, ]))) %>%
  mutate(label = "Estimate")

true_totals <- as_tibble(t(colSums(election[, c("Bush", "Kerry", "Nader")]))) %>%
  mutate(label = "Truth")

rbind(estimated_totals, true_totals) %>%
  gather(everything(), -label, key = "Candidate", value = "Vote Count") %>%
  spread(label, `Vote Count`) %>%
  gt::gt() %>%
  gt::fmt_scientific() %>%
  gt::tab_header("Simulated Total Comparison")
Simulated Total Comparison
Candidate Estimate Truth
Bush 6.01 × 107 5.96 × 107
Kerry 5.57 × 107 5.61 × 107
Nader 4.11 × 105 4.04 × 105

These are quite close.

b). Compute the mean of the estimated standard errors and compare it to the true simulation standard error, that is, the standard deviation of the estimated totals.

estimated_stderrs <- as_tibble(t(rowMeans(total_sims[, 2, ]))) %>%
  mutate(label = "Estimate")

true_stderr <- as_tibble(apply(total_sims, 1, sd)) %>%
  mutate(label = "Truth", Candidate = c("Bush","Kerry", "Nader")) %>% 
  spread(Candidate,value)

rbind(estimated_stderrs, true_stderr) %>%
  gather(everything(), -label, key = "Candidate", value = "Vote Std.Err") %>%
  spread(label, `Vote Std.Err`) %>%
  gt::gt() %>%
  gt::fmt_scientific() %>%
  gt::tab_header("Simulated Standard Error Comparison")

Estimate 95% confidence intervals for the population totals and compute the proportion of intervals that contain the true population value.

totals <- unlist(true_totals[, 1:3])
prop_contained <- rowMeans((total_sims[, 3, ] < totals &
  total_sims[, 4, ] > totals) * 1)
prop_contained

3.10 Repeat 3.2 computing the Hartley-Rao approximation and the full Horvitz-Thompson estimate using the code and joint-probability data on the web site. Compare the standard errors from these two approximations to the standard errors from the single-stage with replacement approxmiation and to the true simulation standard error.

3.11 Since 1999, NHANES has been conducting surveys continuously with data released on a 2 year cycle. Each data set includes a weight variable for analyzing the two-year data; the weights add to the size of the US adult, civilian, non-institutionalized population.

  1. What weight would be appropriate for estimating the number of diabetics in the population combining data from two two-year data sets?
  • Assuming these are non-overlapping samples — which seems reasonable — and considering the target population we’re generalizing to to be roughly constant over the 4 years of interest, we could divide each of the previous weights in half. This would give us twice the sample size to estimate the same target estimand.
  1. What weight would be appropriate if three two-year data sets were used?

  2. What weights would be appropriate when estimating changes, comparing the combined 1999-2000 and 2001-2002 data with the combined 2003-2004 and 2005-2006 data?

  3. How would the answers differ if the goal was to estimate a population proportion rather than a total?

Chapter 4: Graphics

Lumley advocates for three principles in visualizing survey data:

  1. Base the graph on an estimated population distribution.

  2. Explicitly indicate weights on the graph.

  3. Draw a simple random sample from the estimated population distribution and graph this sample instead.

All three of these strategies are meant to counteract the difficulty in visualizing survey data — the data available do not represent the population of interest without re-weighting.

While these principles are great and worth following, I found Lumley’s demonstrations in this chapter a little out of date. Thanks to the ggplot family of libraries and shiny there’s been huge advances in visualizing and interacting with data of all types – survey included. What’s more, it isn’t always clear how he produces the charts in the chapter. Consequently I’ve done my best to reproduce the same charts, or a chart I’d consider appropriate for the question / survey at hand in the following charts.

TODO: Talk about ggsurvey.

Plotting a Table

Lumley recommends using bar charts, forest plots and fourfold plots to visualize the data from a table. I would agree that all of these are “fine” in a certain context but strongly prefer the forest plot and variations on it personally, since it does more than all the other to include representations of uncertainty about the tabulated estimates.

medians %>%
  ggplot(aes(x = racehpr, y = Median_BMI, fill = srsex)) +
  geom_bar(stat = "identity", position = "dodge") +
  ylab("Median BMI (kg/m^2)") +
  xlab("") +
  theme(
    legend.title = element_blank(),
    axis.text.x = element_text(angle = 45, vjust = 0.25, hjust = .25)
  ) +
  ggtitle("Median BMI Across Race, Sex")

medians %>%
  ggplot(aes(x = Median_BMI, y = racehpr, color = srsex)) +
  geom_pointrange(aes(xmin = Median_BMI_low, xmax = Median_BMI_upp),
    position = position_dodge(width = .25)
  ) +
  geom_vline(aes(xintercept = 25), linetype = 2, color = "red") +
  xlab("BMI") +
  ylab("") +
  theme(legend.title = element_blank(), legend.position = "top") +
  ggtitle("Median BMI by Race/Ethnicity and gender, from CHIS") +
  labs(caption = str_c(
    "Line indicates CDC border between healthy and ",
    "overweight BMI."
  ))

One Continuous Variable

Lumley argues that the most straightforward graphical displays for a single continuous variable are boxplots, cumulative distribution function (cdf) plots and survival curves. I think histograms are also useful but we’ll get to that shortly.

Below I reproduce Figure 4.6 from the text in which Lumley demonstrates the difference between the weighted and unweighted CDF’s of the California school data demonstrating that, indeed, the design based estimate is closer to the population value than the unweighted, naive estimate of the stratified design. This should all intuitively make sense. My only complaint would be that the uncertainty of the cdf isn’t visualized when that should be just as easily accessible. I made a cursory investigation to see if these data were available in the svycdf object but they were not.

cdf.est <- svycdf(~enroll + api00 + api99, dstrata)

cdf.pop <- ecdf(apipop$enroll)

cdf.samp <- ecdf(apistrat$enroll)

par(mar = c(4.1, 4.1, 1.1, 2.1))

plot(cdf.pop, do.points = FALSE, xlab = "enrollment",
     ylab = "Cumulative probability", 
     main = "Cumulative Distribution of California School Size",
     sub = "Reproduces Lumley Fig 4.6", lwd = 1)

lines(cdf.samp, do.points = FALSE, lwd = 2)
lines(cdf.est[[1]], lwd = 2, col.vert = "grey60", col.hor = "grey60",
      do.points = FALSE)
legend("bottomright", lwd = c(1,2,2), bty = "n", 
       col = c("black", "grey60", "black"), 
       legend = c("Population", "Weighted estimate", "Unweighted Estimate"))

Next Lumley plots an adjusted kaplan meier curve using svykm() on the National Wilms Tumor Study data. Unfortunately he doesn’t show how to create the design object and I find no mention of it elsewhere in the book. I made a guess and it seems to be accurate. Note that I obtained the national wilms tumor study data from the survival package.

dcchs <- svydesign(ids = ~1, data = nwtco)
scurves <- svykm(Surv(edrel / 365.25, rel) ~ histol, dcchs)
plot(scurves)

The interpretation of this graph is that relapse (into cancer) appears to occur within about three years, if at all. The top line shows the survival for those with a favorable histology classification with an even better survival rate.

Lumley makes a note here describing how the sampling weights make a large difference in the estimate because the design is a case cohort sample. That being said, I don’t really see the difference when I compare his figure to mine created with an equal probability sample… An error here perhaps.

Boxplots

Lumley describes the typical boxplots as

based on the quartiles of the data: the box shows the median and first and third quartiles, the whiskers extend out to the last observation within 1.5 interquartile ranges of the box ends, and all points beyond the whiskers are shown individually.

This is as good a summary as any, I’ll only note that the median is the line within the interior of the box and the first and third quartile define the boundaries of the box. It’s (somewhat) easy to image how Lumley estimates these values, as he has already introduced the svyquantile() function — this should be a straightforward application of that function.

I reproduce Figure 4.10 from the text using the nhanes data and design object created in answering the questions for chapter 3. His code is equivalent.

## Lumley's code
#nhanes <- svydesign(id = ~SDMVPSU, strat = ~SDMVSTRA,
#                    weights = ~WTMEC2YR, data = both, nest = TRUE)

## I use the bpdf design from the chapter 3 questions
nhanes <- bpdf_design %>% 
  mutate(
    age_group = cut(RIDAGEYR, c(0, 20, 30, 40, 50, 60, 80, Inf)),
  )
svyboxplot(BPXSAR ~ age_group, subset(nhanes, BPXSAR > 0), col = "gray80",
           varwidth = TRUE, ylab = "Systolic BP (mmHg)", xlab = "Age")

The plot shows the distribution in blood pressure across the different age groups. I already showed how I visualized this previously, but it is nice to know how to do something multiple ways. In general, our observation shows an increase in systolic blood pressure medians and third quartiles across age.

Graphs based on the density

Histograms and kernel density estimators both visualize the probability distribution function(pdf) of a given dataset. Confusingly, perhaps, there is no pdf to estimate in finite population inference because the sampling is random instead of the data. Still, the analogous effort can be made where a given proportion is estimated for some number of bins of the variable of interest. Instead of estimating a naive proportion, there’s now a sample based proportion. I reproduce Lumley’s Figure 4.12 below demonstrating how this works.

svyhist(~BPXSAR, subset(nhanes, RIDAGEYR > 20 & BPXSAR > 0), main = "",
        col="grey80", xlab="Systolic BP (mmHg)")
lines(svysmooth(~BPXSAR,  nhanes, bandwidth = 5,
                subset = subset(nhanes, RIDAGEYR > 20 & BPXSAR > 0)), lwd = 2)

This looks considerably different then the naive estimate, again showcasing the need for taking the design into account.

hist(bpdf_design$variables$BPXSAR, xlab = "Systolic BP (mmHg)",
     main = "Naive Distribution of Blood Pressure")

Two Continuous Variables

Where the design could be used to adjust the raw sample summaries to adjust for the design with one variable, two variable visualizations make this much more difficult as there is no way to incorporate the design information into the presented estimate. An additional dimension has to be used to illustrate the design information.

Scatterplots

To that end, Lumley uses a bubbleplot with the size of the “bubbles” proportional to the sampling weight. This way, the viewer can identify which points should be considered as “more” or less impactful in terms of representing the population. I reproduce part of Figures 4.13 - 4.15 from the text below, for those plots in which Lumley includes the code. plots the systolic and diastolic blood pressure from the NHANES design.

#TODO(petersonadam): Try plotting the api data here too
par(mfrow = c(1, 2))
svyplot(BPXSAR ~ BPXDAR, design = nhanes, style = "bubble", 
        xlab = "Diastolic pressure (mmHg)",
        ylab = "Systolic pressure (mmHg)")
svyplot(BPXSAR ~ BPXDAR, design = nhanes, style = "transparent",
        pch = 19, alpha = c(0, 0.5), 
        xlab = "Diastolic pressure (mmHg)",
        ylab = "Systolic pressure (mmHg)")

As Lumley notes, this is a case in which the sampling weights — bubble size — is not particularly informative, as we don’t see any particularly obvious pattern relating the size of the bubbles to the two variables of interest. Turning to the shading plot, though we can now see more than just the “blob” in the first plot, there’s still no apparent pattern between the bubble size and the two variables to identify.

Below, I include Lumley’s “subsample” method, which draws a simple random sample from the target population by using the provided sampling probabilities. Lumley reccomends looking at 2 or three iterations of a subsample plot in order to ensure that any features visualized are not “noise” from the sampling process.

svyplot(BPXSAR ~ BPXDAR, design = nhanes, style = "subsample", 
        sample.size = 1000,
        xlab = "Diastolic pressure (mmHg)",
        ylab = "Systolic pressure (mmHg)")

#### Aggregation and smoothing

While less of an issue than it was at the time of Lumley’s writing, aggregating and smoothing across points makes it easier to condense the number of sample points in a visualization to reduce the memory required to visualize all the points in the sample. In a design setting it also provides an opportunity to incorporate the design information into a visualization specific estimate.

Lumley’s first example of this is a hexplot - a plot where a grid of hexagons is created and sized according to the number of points within the hex-bin.

svyplot(BPXSAR ~ BPXDAR, design = nhanes, style = "hex", legend = 0, 
        xlab = "Diastolic pressure (mmHg)",
        ylab = "Systolic pressure (mmHg)")

This is perhaps the best visualization of the data thus far, where the “blob” is still apparent but outliers are still visible and a more concise summary of the data is easily visible.

Scatterplot smoothers

Lumley next describes how to take the discrete estimates examining blood pressure as a function of age and smooth them when both variables are continuous. The code below shows how to do this by using the svyplot() function using quantile regression methods from the quantreg package. Mean estimates can also be obtained by using method = 'locpoly'.

adults <- subset(nhanes, !is.na(BPXSAR))
a25 <- svysmooth(BPXSAR ~ RIDAGEYR, method = "quantreg", design = adults,
                 quantile = .25, df = 4)
a50<- svysmooth(BPXSAR ~ RIDAGEYR, method = "quantreg", design = adults,
                 quantile = .5, df = 4)
a75<- svysmooth(BPXSAR ~ RIDAGEYR, method = "quantreg", design = adults,
                 quantile = .75, df = 4)
a10<- svysmooth(BPXSAR ~ RIDAGEYR, method = "quantreg", design = adults,
                 quantile = .1, df = 4)
a90<- svysmooth(BPXSAR ~ RIDAGEYR, method = "quantreg", design = adults,
                 quantile = .9, df = 4)
plot(BPXSAR ~ RIDAGEYR, data = nhanes, type = "n", ylim = c(80, 200),
     xlab = "Age", ylab = "Systolic Pressure")
lines(a50, lwd = 3)
lines(a25, lwd = 1)
lines(a75, lwd = 1)
lines(a10, lty = 3)
lines(a90, lty = 3)
legend("topleft", legend = c("10%,90%", "25%, 75%", "median"),
       lwd = c(1, 1, 3), lty = c(3, 1, 1), bty = "n")

Conditioning Plots

A conditioning plot is effectively a scatter plot with a third variable fixed. This third variable is then displayed in the facet title of the plot. In Lumley’s text he shows how to do this using a call to the svycoplot() function to recreate the top half of Figure 4.20. Of note, the additional hexscale argument can be fed the "absolute" argument to make the scales comparable between panels - by default the scales are facet specific, though the data (for a continuous facet) has a ~50% overlap so the scales are not dramatically different.

svycoplot(BPXSAR ~ BPXDAR | equal.count(RIDAGEMN), 
          style = "hex",  # or "transparent" for shaded hexs,
          # hexscale = "absolute" # for fixed scales across facets.
          design = subset(nhanes, BPXDAR > 0), xbins = 20,
          strip = strip.custom(var.name = "AGE"),
          xlab = "Diastolic pressure (mm Hg)",
          ylab = "Systolic pressure (mm Hg)"
)

Maps

Design and Estimation Issues

The final section in this chapter looks at how to visualize survey data spatially across a map. Since many surveys contain geographic information in both their sampling design and the questions they seek to answer, it makes sense that one might want to visualize estimates at some geographic scale.

Lumley uses the maptools R package to take estimates computed using techniques/functions already demonstrated and visualize them on maps.

Since maptools isn’t available on CRAN at the current time of writing, I’ll use the sf package and ggplot2.

Below I reproduce Figure 4.21 from the text using these packages with the same design based estimates in the text. The data was procured from Lumley’s website — search for “BRFSS”.

states <- read_sf("Data/BRFSS2007/brfss_state_2007_download.shp") %>% 
  arrange(ST_FIPS)
bys <- svyby(~X_FRTSERV, ~X_STATE, svymean,
             design = subset(brfss, X_FRTSERV < 99999))
state_fruit_servings <- states %>% 
  select(ST_FIPS) %>% 
  st_drop_geometry() %>% 
  left_join(bys, by = c("ST_FIPS" = "X_STATE")) %>% 
  mutate(geometry = states$geometry) %>% 
  st_as_sf()
  
state_fruit_servings %>% 
  ggplot() + 
  geom_sf(aes(fill = X_FRTSERV / 100)) +
  theme_void() +
  theme(legend.title = element_blank()) +
  ggtitle("Servings of fruit per day, from BRFSS 2007")

hlth <- brfss %>% 
  as_survey_design() %>% 
  mutate(agegp = cut(AGE, c(0, 35, 50, 65, Inf)),
         state = X_STATE,
         covered = (HLTHPLAN == 1) * 1 ) %>% 
  group_by(agegp, state) %>% 
  summarize(
    health_coverage = survey_mean(covered)
  ) %>% 
  ## Formatting
  mutate(Age = case_when(agegp == "(0,35]" ~ "<35",
                         agegp == "(35,50]" ~ "35-50",
                         agegp == "(50,65]" ~ "50-65",
                         agegp == "(65,Inf]" ~ "65+"))



insurance_coverage <- states %>% 
  select(ST_FIPS,geometry) %>% 
  left_join(hlth, by = c("ST_FIPS" = "state")) %>% 
  st_as_sf()

insurance_coverage %>% 
  ggplot(aes(fill = health_coverage)) +
  geom_sf() +
  facet_wrap(~Age) +
  theme_void() +
  theme(legend.title = element_blank()) 

Lumley then shows two plots with insurance coverage at the state level. Looking at the code he posted on the web it isn’t clear to me what he’s estimating that’s different here as I don’t see any calls to any of the survey functions. Consequently, I’ve just created some fake simulated data and created a plot with the equivalent data.

cities <- read_sf("Data/BRFSS2007/BRFSS_MMSA_2007.shp") %>% 
  filter(NAME != "Columbus") %>%
  transmute(Insurance = cut(rbeta(n(),1,1),c(0,0.25,.5,.75,1)))

marginal_insurance <- brfss %>%
  as_survey_design() %>%
  mutate(
    covered = (HLTHPLAN == 1) * 1,
    state = X_STATE,
    ) %>%
  group_by(state) %>%
  summarize(
    health_coverage = survey_mean(covered)
  ) %>% ungroup()

map_data <- states %>%
  select(ST_FIPS, geometry) %>%
  left_join(marginal_insurance, by = c("ST_FIPS" = "state")) %>%
  st_as_sf()

ggplot() +
  geom_sf(data = map_data, aes(fill = health_coverage)) +
  geom_sf(data = cities, aes(color = Insurance)) +
  theme_void() +
  theme(legend.title = element_blank()) +
  ggtitle("Insurance Coverage - from BRFSS and Fake Data")

Questions

  1. Draw boxplots of body mass index by race/ethnicity and by sex using the CHIS 2005 data introduced in Chapter 2.
svyboxplot(bmi_p ~ racehpr, chis, main = "BMI By Race/Ethnicity")

svyboxplot(bmi_p ~ srsex, chis, main = "BMI BY Sex")

  1. Using the code in Figure 3.8 draw a barplot of the quantiles of income and compare it to the dotchart in Figure 3.9. What are some advantages and disadvantages of each display.

  2. Use svysmooth() to draw a graph showing change in systolic and diastolic blood pressure over time in the NHANES 2003-2004 data. Can you see the change to isolated systolic hypertension in old age that is shown in Figure 4.5.

plot(svysmooth(BPXSAR ~ RIDAGEYR, nhanes))

  1. With the data from the SIPP 1996 panel draw the cumulative distribution function densityfunction, a histogram and a boxplot of total household income. Compare these graphs for their usefulness in showing the distribution of income.

  2. With the data from the SIPP 1996 panel draw a graph showing amount of rent (tmthrent) and proportion of income paid to rent. You will want to exclude some outlying households that report much higher rent than income.

  3. Using data from CHIS 2005 (see section 2.3.1) examine how body mass index varies with age as we did with blood pressure in this chapter.

plot(svysmooth(bmi_p ~ srage_p, chis), 
     xlab = "Age", ylab = "BMI",
     main = "BMI vs. Age (CHIS data)")

We see a very distinctive U-shape - with middle aged indviduals - 50 to 60 year olds - having the highest average BMI, but the young and old having lower BMI’s.

  1. The left-hand panel of Figure 3.3 shows an interesting two-lobed pattern. Can you find what makes these lobes?

  2. Set up a survey design for the BRFSS 2007 data as in Figure 3.7. BRFSS measured annual income (income2) in categories < $10K, $10-15k $20-25k, $25-35k, $35-50k, $50-75k, > $75k and race (orace) :white,black, asian, native hawaiian/pacific island, native american, other.

  • Draw a graph of income by race.
  • Draw maps showing the geographical distribution of income and of racial groups.
  • Draw a set of maps examining whether the geographical distribution of income differs by race.
  1. Explore the impact on the graphs in Figure 4.18 of changes in the amount of smoothing, by altering the df argument to the code in Figure 4.19

Chapter 5: Ratios and Linear Regression

Lumley identifies two main uses for regression in analyzing complex surveys:

  1. Identifying relationships between variables — similar to any other data.
  2. More accurate estimates of population means and totals.

In contrast to model based inference which typically discusses linear regression in the context of distributional assumptions on the outcome variable, Lumley notes that his discussion of regression is still within the design-based philosophy and consequently no model based assumptions are needed in order to compute valid 95% confidence intervals. However, model choice still matters insofar as one’s goal is to precisely estimate a relationship or population mean or total.

In statistical parlance, Lumley is using GEE - or generalized estimating equations to fit models, which only require assumptions about the moments of the underlying data distribution, rather than the family of the distribution.

My take is that this approach also allows for more flexibility in how the variance of the model can be estimated. Since the variance associated with a complex design is a function of the design itself, GEE makes it easier to structure the estimating equations in such a way that the variance computation corresponds to the design. In fact, Lumley explicitly notes in section 5.2.1 that the heteroskedastic / sandwhich based variance estimators which came out of GEE are used by the survey package, with special handling for combining variance within/across strata.

Ratio Estimation

Ratio estimation comes up first in this chapter because it is important in estimating (1) a population mean/total, (2) a ratio directly or (3) a subpopulation estimate of a mean.

Lumley illustrates how to estimate ratios in design methods by using the api dataset and estimating the proportion of students who took the Academic Performance Index exam.

svyratio(~api.stu, ~enroll, strat_design)
## Ratio estimator: svyratio.survey.design2(~api.stu, ~enroll, strat_design)
## Ratios=
##            enroll
## api.stu 0.8369569
## SEs=
##              enroll
## api.stu 0.007757103

This estimate does a good job of estimating the true population total, 0.84, which we happen to have access to for this example.

It’s worth noting here as Lumley does that ratio estimates are not unbiased but are classified as “approximately unbiased” since the bias decreases proportional to the sample size and is consequently smaller than the standard error – which decreases proportional to \(\frac{1}{\sqrt{n}}\).

Ratios for subpopulation estimates

In the case where individual - not aggregate - data are available the ratio being estimated is simply a proportion. This is the same logic for which subpopulation estimates had been calculated previously in [Chapter 2:Simple Random Sample] via the svyby() function. These estimates require special handling - though in the survey package they can all be calculated via svymean(), svyratio() and svycontrast() which I show below using the same designe object from Chapter 2.

It is worth noting before doing so however, that the special handling needed here follows from the fact that both the numerator and the denominator are estimated. Lumley delves into this in the appendix which I’ll reproduce here alongside questions I have that I’ll look to return to in the future.

A brief aside on ratio variance estimation

We’ll define the subpopulation estimate of interest using the indicator function \(Y_i = X_iI(Z_I > 0)\), where \(I(Z_i > 0) = 1\) for members of the subpopulation and 0 otherwise, and \(X_i\) is the measurement of interest.

The variance estimate using replicate weights can be calculated similar to the typical variance estimate, those replicate weights belonging to sample observations outside the subpopulation are simply not used - this again highlights the utility of replicate weights.

For linearization - that is using a taylor series to estimate the variance - the value becomes more complicated, following Lumley’s appendix the HTE is defined as:

\[ \hat{V}[\hat{T}_Y] = \sum_{i,j} \left(\frac{Y_i Y_j}{\pi_ij} - \frac{Y_i}{\pi_i} \frac{Y_j}{\pi_j} \right) \\ = \sum_{Z_i,Z_j > 0} \left(\frac{Y_i Y_j}{\pi_ij} - \frac{Y_i}{\pi_i} \frac{Y_j}{\pi_j} \right). \]

Here however, Lumley states

but the simplified computational formulas for special designs are not the same.

which I don’t completely understand. I suppose he means for clustered or multiphase designs things but it isn’t clear as he goes onto say

for example, the formula for the variance of a total under simple random sampling (equation 2.2)

\[ V[\hat{T}_X] = \frac{N-n}{N} \times N^2 \times \frac{V[X]}{n} \]

cannot be replaced by

\[ V[\hat{T}_Y] \stackrel{?}{=} \frac{N-n}{N} \times N^2 \times \frac{V[X]}{n} \]

or even, defining \(n_D\) as the number sampled in the subpopulation, by

\[ \stackrel{?}{=} \frac{N-n_D}{N} \times N^2 \times \frac{V[X]}{n_D} \]

In order to use these simplified formulas it is necessary to work with the variable \(Y\) and use

\[ V[\hat{T}_Y] = \frac{N-n}{N} \times N^2 \times \frac{V[Y]}{n} \]

Its not clear in this last expression if we’re simply back to the initial expression that couldn’t be used, or if we’re using the smaller sample subset again for variance computations but Lumley’s next text suggests that’s the case:

Operationally, this means that variance estimation in a subset of a survey design object in R needs to involve the \(n - n_D\) zero contributions to an estimation equation.

I hope to shed more light on what’s going on here in the future but for now its clear why this is in the appendix, but not exactly clear to me why observations outside the subpopulation are simply zero’d in variance computation.

Lumley uses the following three function calls to illustrate three different ways to estimate a ratio (1) A call to svymean(), (2) A call to svyratio() and (3) `

svymean(~bmi_p, subset(chis, srsex == "MALE" & racehpr == "AFRICAN AMERICAN"))
##         mean     SE
## bmi_p 28.019 0.2663
chis <- update(chis,
  is_aamale = (srsex == "MALE" & racehpr == "AFRICAN AMERICAN")
)
svyratio(~ I(is_aamale * bmi_p), ~is_aamale, chis)
## Ratio estimator: svyratio.svyrep.design(~I(is_aamale * bmi_p), ~is_aamale, chis)
## Ratios=
##                      is_aamale
## I(is_aamale * bmi_p)  28.01857
## SEs=
##          [,1]
## [1,] 0.266306
totals <- svytotal(~ I(bmi_p * is_aamale) + is_aamale, chis)
totals
##                         total       SE
## I(bmi_p * is_aamale) 19348927 260401.7
## is_aamaleFALSE       25697039   6432.3
## is_aamaleTRUE          690575   6341.9
svycontrast(
  totals,
  quote(`I(bmi_p * is_aamale)` / `is_aamaleTRUE`)
)
##           nlcon     SE
## contrast 28.019 0.2663

Ratio estimators of totals

The third use Lumley lists for the use of ratio estimators is to construct more accurate estimates of population means or totals. His motivating example is to take the ratio estimate of individuals who took the API tests and then use that to determine the approximate number of students who took the test, by multiplying the ratio estimate by the number of students. This can be done by hand or via the survey package predict() function used in conjunction with svyratio().

r <- svyratio(~api.stu, ~enroll, strat_design)
predict(r, total = sum(apipop$enroll, na.rm = TRUE))
## $total
##          enroll
## api.stu 3190038
## 
## $se
##           enroll
## api.stu 29565.98

Lumley uses this as a jumping off point to discuss linear regression, since,

one can imagine the relationship between the number of students taking the API test as being roughly proportional to the number of students enrolled at the schools; \(E[tests_i] = \alpha \times \text{enrollment}_i + \epsilon_i\) where \(E[\epsilon] = 0\) (note that this is an assumption about the moment of the error distribution and not the shape itself).

Linear Regression

A quick review of the moment assumptions for linear regression - which is all that are needed for designed based estimation.

If we have random response variable \(Y\) and explanatory variable \(X\) then linear regression is looking at the relationship between the expectation of \(Y\) and \(X\): \[ E[Y] = \alpha + X \beta, \]

Where \(\alpha\) is a constant offset, also called the intercept and \(\beta\) is the slope coefficient that describes the change in \(E[Y]\) per unit change in \(X\). Here I’m referring to \(X\) and \(\beta\) as singular variables but they could also be a matrix and vector of explanatory variables and slope coefficients, respectively. The variance of the response variable, \(Y\) is assumed to be constant, i.e. \(V[Y] = \sigma^2\), unless otherwise modeled.

Following the standard OLS estimation procedure, we’d normally minimize the squared error and Lumley notes that if we have the complete population data, we’d be finished at that:

\[ RSS = \sum_{i=1}^{N} (Y_i - \alpha X_i\beta)^{2}. \]

Given that we’re typically dealing with (complex) samples though, we have to adjust the estimates to account for the weighting:

\[ \hat{RSS} = \sum_{i=1}^{n} \frac{1}{\pi_i}(Y_i - \alpha - X_i \beta)^2, \]

so each error term is up weighted according to its sampling probability.

Regression Estimation of Population Totals

Lumley’s ratio estimator of a population total described previously derives from the linear regression model with a single predictor and no intercept. The separate ratio estimator, similar to the cell means model estimates a ratio for each stratum and the estimate of the total is the sum of the denominator for all strata. Formally,

\[ E[Y_i] = \beta_k \times X_i \times \{i \in \text{ stratum } k\}, \]

where \(\beta_k\) is the ratio for stratum \(k\) estimates the given quantity. This setup can provide more precise estimates than the single ratio estimator when the sample size gets large and the strata are able to better explain the outcome variable. However, if the strata don’t have any correlation with the outcome then the standard errors increase, due to the need to estimate the extra parameters.

THIS EXAMPLE DOESN’T MAKE COMPLETE SENSE. Lumley illustrates this latter phenomenon with the California school data - using the percentage of english language learners as a predictor of the overall number of students taking the api tests.

sep <- svyratio(~api.stu, ~enroll, strat_design, separate = TRUE)
com <- svyratio(~api.stu, ~enroll, strat_design)
stratum_totals <- list(E = 1877350, H = 1013824, M = 920298)
predict(sep, total = stratum_totals)
## $total
##          enroll
## api.stu 3190022
## 
## $se
##           enroll
## api.stu 29756.44
predict(com, total = sum(unlist(stratum_totals)))
## $total
##          enroll
## api.stu 3190038
## 
## $se
##           enroll
## api.stu 29565.98

We see the common ratio has a smaller standard error than the separate ratio estimator.

svyby(~api.stu, ~stype, design = dstrata, denom = ~enroll, svyratio)
##   stype api.stu/enroll se.api.stu/enroll
## E     E      0.8558562       0.006034685
## H     H      0.7543378       0.031470156
## M     M      0.8331047       0.017694634

Incomes in Scotland

Lumley works through an example looking at household incomes across Scotland from their national household survey. Unfortunately both the dataset subset he provides at his website to and the full dataset he links to don’t have the variables he uses in his example code in Figure 5.7. For example the code he provides is filtered using an ADULTH variable which isn’t found in either dataset.

load("Data/SHS/shs.rda") # Lumley's website data
colnames(shs$variables)
##  [1] "psu"      "uniqid"   "ind_wt"   "shs_6cla" "council"  "rc5"     
##  [7] "rc7e"     "rc7g"     "intuse"   "groupinc" "clust"    "stratum" 
## [13] "age"      "sex"      "emp_sta"  "grosswt"  "groc"
load("Data/SHS/ex2.RData") # PEAS "full Data" website
colnames(shs)
##  [1] "PSU"      "UNIQID"   "IND_WT"   "SHS_6CLA" "COUNCIL"  "RC5"     
##  [7] "RC7E"     "RC7G"     "INTUSE"   "GROUPINC" "CLUST"    "STRATUM" 
## [13] "AGE"      "SEX"      "EMP_STA"  "GROSSWT"  "GROC"

Consequently, I won’t reproduce this example here, except to say that he uses the example to illustrate how oversampling certain strata (poorer households) can improve the precision associated with the household income estimate and that using the population information increased the precision of the weekly household income estimate via linear regression – for those sub-populations for which population information is available.

US Elections

Similarly it doesn’t look like the dataset included in the current survey R package has data for the 2008 election - I only see Bush / McCain vote totals.

data(elections)
colnames(election)
## [1] "County"             "TotPrecincts"       "PrecinctsReporting"
## [4] "Bush"               "Kerry"              "Nader"             
## [7] "votes"              "p"

Consequently I can’t do the analysis he shows predicting 2008 votes using 2000 votes. It’ll hopefully suffice to say that in theme with the content for the chapter, that because these values are correlated — 2000 vote % for republican candidate and 2008 vote % for republican candidate — it stands to reason that we can reduce the variance of the resulting estimate rather than using the 2008 data alone.

Confouding and other criteria for model choice

Lumley describes three categories for describing why a predictor might be included in a regression, noting that this may help the model fit better and thus aid in reducing bias from a probability sample that results from, say non-response.

  1. Exposure of interest: If we’re interested in a specific variable’s impact on a variable, it makes sense to include that in a model to estimate the relationship.

  2. Confounding variables: A variable may not be of primary interest, but may be associated with both the outcome variable and exposure of interest. Consequently, this will need to be adjusted for in order to isolate the effect of interest.

  3. Precision variables: These are, again, associated with the outcome variable of interest, but not associated with the exposure of interest. However, because of their association alone, they can increase the precision with which the exposure effect is estimated.

Lumley goes on to describe methods for model selection which I’ll leave for the text.

Linear models in the survey package

Example:Dietary sodium and potassium and blood pressure

demo <- haven::read_xpt("data/nhanesxpt/demo_c.xpt")[, c(1:8, 28:31)]
bp <- haven::read_xpt("data/nhanesxpt/bpx_c.xpt")
bm <- haven::read_xpt("data/nhanesxpt/bmx_c.xpt")[, c("SEQN", "BMXBMI")]
diet <- haven::read_xpt("data/nhanesxpt/dr1tot_c.xpt")[, c(1:52, 63,64)]
nhanes34 <- merge(demo, bp, by = "SEQN")
nhanes34 <- merge(nhanes34, bm, by = "SEQN")
nhanes34 <- merge(nhanes34, diet, by = "SEQN")

demo5 <- haven::read_xpt("data/nhanesxpt/demo_d.xpt")[, c(1:8,39:42)]
bp5 <- haven::read_xpt("data/nhanesxpt/bpx_x.xpt")

bp5$BPXSAR <- rowMeans(bp5[,c("BPXSY1","BPXSY2","BPXSY3","BPXSY4")], 
                       na.rm = TRUE)
bp5$BPXDAR <-  rowMeans(bp5[,c("BPXDI1","BPXDI2","BPXDI3","BPXDI4")], 
                       na.rm = TRUE)
bm5 <- haven::read_xpt("data/nhanesxpt/bmx_d.xpt")[, c("SEQN", "BMXBMI")]
diet5 <- haven::read_xpt("data/nhanesxpt/dr1tot_d.xpt")[, c(1:52,64,65)]


nhanes56 <- merge(demo5, bp5, by = "SEQN")
nhanes56 <- merge(nhanes56, bm5, by = "SEQN")
nhanes56 <- merge(nhanes56, diet5, by = "SEQN")

nhanes <- rbind(nhanes34, nhanes56)
nhanes$fouryearwt <- nhanes$WTDRD1 / 2
# I added the two lines below to make graphing the
# smooth plot easier
nhanes$sodium <- nhanes$DR1TSODI / 1000 
nhanes$potassium <- nhanes$DR1TPOTA / 1000 

des <- svydesign(id = ~SDMVPSU, strat = ~SDMVSTRA, weights = ~fouryearwt,
                 nest = TRUE, data = subset(nhanes, !is.na(WTDRD1)))

des <- update(des, sodium = DR1TSODI / 1000, potassium = DR1TPOTA / 1000)
des
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (60) clusters.
## update(des, sodium = DR1TSODI/1000, potassium = DR1TPOTA/1000)

Lumley uses the following plot — examining systolic blood pressure as a function of daily sodium intake — to motivate the need to adjust for confounders. As we can see below in the reproduced Figure 5.10, there doesn’t appear to be much a relationship between sodium intake and average blood pressure. Lumley argues that we observe this simpler relationship because of the association between sodium and blood pressure is confounded by age.

plot(BPXSAR ~ sodium, data = nhanes, type = "n")

points(svyplot(BPXSAR ~ sodium, design = des, style = "transparent", xlab = 
          "Dietary Sodium (g/day)", ylab = "Systolic Blood Pressure (mm Hg)"))
lines((svysmooth(BPXSAR ~ sodium, des)))

To test this hypothesis, Lumley first visualizes the three variables using the conditional plot demonstrated previously.

svycoplot(BPXSAR ~ sodium | equal.count(RIDAGEYR), des,
          style = "hexbin",
          xlab = "Dietary Sodium (g/day)",
          ylab = "Systolic BP (mmHg)",
          strip = strip.custom(var.name = "Age"))

In the above plot we see a greater indication that as dietary sodium increases , so too does systolic blood pressure.

To more formally test the hypothesis, Lumley fits several models with these variables included. The first two just include (1) sodium and potassium and (2) sodium, potassium and Age.

model0 <- svyglm(BPXSAR ~ sodium + potassium, design = des)
summary(model0)
## 
## Call:
## svyglm(formula = BPXSAR ~ sodium + potassium, design = des)
## 
## Survey design:
## update(des, sodium = DR1TSODI/1000, potassium = DR1TPOTA/1000)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 120.3899     0.7105 169.436  < 2e-16 ***
## sodium       -0.6907     0.1658  -4.166 0.000268 ***
## potassium     0.7750     0.2655   2.919 0.006853 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 382.6159)
## 
## Number of Fisher Scoring iterations: 2
model1 <- svyglm(BPXSAR ~ sodium + potassium + RIDAGEYR, design = des)
summary(model1)
## 
## Call:
## svyglm(formula = BPXSAR ~ sodium + potassium + RIDAGEYR, design = des)
## 
## Survey design:
## update(des, sodium = DR1TSODI/1000, potassium = DR1TPOTA/1000)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 99.73535    0.79568 125.346  < 2e-16 ***
## sodium       0.79846    0.14866   5.371 1.13e-05 ***
## potassium   -0.91148    0.18994  -4.799 5.23e-05 ***
## RIDAGEYR     0.49561    0.01169  42.404  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 275.4651)
## 
## Number of Fisher Scoring iterations: 2

As we can see, the sodium and potassium coefficient signs in the first model switch direction once age is included in the second model, demonstrating that the two variables are associated with both age and systolic blood pressure. The second model makes more sense intuitively, because we expect systolic blood pressure to increase, on average, as a function of sodium intake.

Lumley adds a few more possible confounders to model2 and then tests to see whether the effects of daily dietary sodium and potassium on systolic blood pressure are significantly different than zero using the regTermTest() function.

model2 <- svyglm(BPXSAR ~ sodium + potassium + RIDAGEYR + RIAGENDR + BMXBMI,
                 design = des)
summary(model2)
## 
## Call:
## svyglm(formula = BPXSAR ~ sodium + potassium + RIDAGEYR + RIAGENDR + 
##     BMXBMI, design = des)
## 
## Survey design:
## update(des, sodium = DR1TSODI/1000, potassium = DR1TPOTA/1000)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 97.32903    1.42668  68.220  < 2e-16 ***
## sodium       0.43458    0.16164   2.689   0.0126 *  
## potassium   -0.96119    0.17043  -5.640 7.19e-06 ***
## RIDAGEYR     0.45791    0.01080  42.380  < 2e-16 ***
## RIAGENDR    -3.38208    0.38403  -8.807 3.90e-09 ***
## BMXBMI       0.38460    0.03797  10.129 2.48e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 263.5461)
## 
## Number of Fisher Scoring iterations: 2
regTermTest(model2, ~potassium + sodium, df = NULL)
## Wald test for potassium sodium
##  in svyglm(formula = BPXSAR ~ sodium + potassium + RIDAGEYR + RIAGENDR + 
##     BMXBMI, design = des)
## F =  15.98481  on  2  and  25  df: p= 3.3784e-05

The test is formally examining whether a model with these terms is more likely, given the data, then one without, with the null hypothesis assuming that the extra terms are unneccessary. As we can see, the model is very unlikely to fit so well with the two extra terms, so there’s evidence to support the association between the terms and blood pressure.

Lumley digs into further details of why the effect is so small — 1 gram of sodium (2.5 grams of salt) is a lot of salt required to increase systolic blood presssure “only” .43 mmHg. Some explanations include measurement error, missing data, and model misspecification. Lumley examines the last of these by displaying the model diagnostic plots. Model misspecification can be examined by identifying any association between the partial residuals and the observed sodium intake. I replicate Lumley’s code below.

par(mfrow = c(1, 2))
plot(as.vector(predict(model1)), resid(model1), 
     xlab = "Fitted Values", ylab = "Residuals")
nonmissing <- des[-model1$na.action]
plot(nonmissing$variables$sodium,
     resid(model1, "partial")[,1],
     xlab = "Sodium", 
     ylab = "Partial Residuals")

nonmissing <- des[-model1$na.action]
par(mfrow = c(1, 2))
plot(model1, panel = make.panel.svysmooth(nonmissing))

termplot(model1, data = model.frame(nonmissing),
         partial = TRUE, se = TRUE, smooth = make.panel.svysmooth(nonmissing))

int1 <- svyglm(BPXSAR ~ (sodium+potassium)*I(RIDAGEYR-40) + RIAGENDR + BMXBMI,
               design = des)
summary(int1)
## 
## Call:
## svyglm(formula = BPXSAR ~ (sodium + potassium) * I(RIDAGEYR - 
##     40) + RIAGENDR + BMXBMI, design = des)
## 
## Survey design:
## update(des, sodium = DR1TSODI/1000, potassium = DR1TPOTA/1000)
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                116.215191   1.240298  93.699  < 2e-16 ***
## sodium                       0.309581   0.165746   1.868 0.074583 .  
## potassium                   -0.975994   0.182598  -5.345 1.99e-05 ***
## I(RIDAGEYR - 40)             0.605278   0.022047  27.455  < 2e-16 ***
## RIAGENDR                    -3.516219   0.377502  -9.314 2.87e-09 ***
## BMXBMI                       0.387957   0.038347  10.117 6.14e-10 ***
## sodium:I(RIDAGEYR - 40)     -0.015707   0.008767  -1.792 0.086356 .  
## potassium:I(RIDAGEYR - 40)  -0.039575   0.010121  -3.910 0.000703 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 261.658)
## 
## Number of Fisher Scoring iterations: 2

Is Weighting Needed in a Regression Model ?

Lumley caps off this chapter asking whether we even need to bother with the specialized weighting built into the survey package. His answer is worth digging into:

Since regression models use adjustment for confounders as a way of removing distorted associations between exposure and response, it is plausible that a regression model might not need sampling weights.

A key assumption here is whether the population we’re estimating is stable across the population(s) represented by our data set. The naive, biased population that our sample represents when unweighted, or the “target” population our sample represents when re-weighted. A follow-up question would ask whether we could even estimate the relationship as desired, if the effect is heterogeneous across populations. Lumley cites (DuMouchel and Duncan 1983) for further discussion on this topic.

I have more to say here — Thinking of this article [gelman2007struggles] —, but it may not fit in these notes. For now I’ll end with Lumley’s two limitations for regression models when not using weights:

  1. Some important variables used in constructing the weights may not be available,

  2. Further, the important variables mentioned above may not be suitable for including in the model.

Lumley urges caution in this regard, advising that even a small amount of bias introduced from not including the weights may make any potential increase in precision that comes from not using the weights as poor trade-off.

Questions

  1. This exercise uses the WA State crime data for 2004 as the population. The data consists of crime rates and population sizes for the police districts (in cities/towns) and shefiffs’ offices (in unincorporated areas), grouped by county.
  • Take a simple random sample of ten counties from the state and use all the data from the sampled counties. Estimate the total number of murders and burglaries in the state.
county_sample <- wa_crime_df %>% 
  distinct(County) %>% 
  slice_sample(n = 10) %>% 
  pull(County)

wa_crime_df %>% 
  filter(County %in% county_sample) %>% 
  as_survey_design() %>% 
  summarize(
    crime = survey_total(murder_and_crime)
  )
## # A tibble: 1 × 2
##   crime crime_se
##   <dbl>    <dbl>
## 1  8644    2238.
  • Use the population of each county as an auxiliary variable to estimate the totals.
ratio_estimate <- wa_crime_df %>% 
  filter(County %in% county_sample) %>% 
  as_survey_design() %>% 
  svyratio(~murder_and_crime, ~Population, design = .)

predict(ratio_estimate, total = sum(wa_crime_df$Population)) 
## $total
##                  Population
## murder_and_crime   56678.22
## 
## $se
##                  Population
## murder_and_crime   5888.009
  • Use the numbers of murders and burglaries in the previous year as auxiliary variables in a regression estimate of the totals (why can’t we use a ratio estimate here?)
wa_crime_03_df <- readxl::read_xlsx("data/WA_crime/1984-2011.xlsx", skip = 4) %>%
  filter(Year == "2003", Population > 0) %>%
  mutate(
    murder_and_crime = `Murder Total` + `Burglary Total`,
    state_pop = sum(Population),
    County = stringr::str_to_lower(County),
    num_counties = n_distinct(County),
  ) %>%
  group_by(County) %>%
  mutate(num_agencies = n_distinct(Agency)) %>%
  ungroup() %>%
  select(
    County, Agency, Population, murder_and_crime,
    num_counties, num_agencies
  )

model_df <- wa_crime_03_df %>% 
  mutate(year = "2003") %>% 
  bind_rows(wa_crime_df %>% mutate(year = "2004")) %>% 
  mutate(
    # We'll use 2004's numbers here for the fpc.
    num_counties = if_else(year == "2004", num_counties, 0),
    num_agencies = if_else(year == "2004", num_agencies, 0),
  ) %>% 
  spread(year, murder_and_crime) %>% 
  group_by(County,Agency) %>% 
  summarize( 
    num_counties = sum(num_counties),
    num_agencies = sum(num_agencies),
    `2003` = sum(replace_na(`2003`,0)),
    `2004` = sum(replace_na(`2004`,0))
  ) %>% 
  ungroup() %>% 
  filter(
    # Agencies were removed between 2003 and 2004. 
    num_counties > 0, num_agencies > 0,
    County %in% county_sample
    ) 

model_design <- model_df %>%  
  as_survey_design(id = c(County, Agency),
                   fpc = c(num_counties, num_agencies))

fit <- svyglm(`2004` ~ `2003`, design = model_design)

total_matrix <- c(sum(wa_crime_03_df$murder_and_crime))
total_matrix <- as.data.frame(total_matrix)
names(total_matrix) <- "2003"

predict(fit, newdata = total_matrix)
##    link     SE
## 1 52874 1138.2

We can’t use a ratio estimator here because we’re not using the total population of the year in question as the denominator, we’re only looking at the total number of murders and burglaries in the previous year and relating it to the next, without measuring the total population, explicitly.

  • Stratify the sampling so that King County is sampled with 100% probability together with a simple random sample of five other counties. use population as an auxiliary variable to construct a common ratio estimate and a separate ratio estimate of the population totals.
smaller_county_sample <- wa_crime_df %>% 
  distinct(County) %>% 
  filter(County != "king") %>% 
  slice_sample(n = 5) %>% 
  pull(County)

county_list <- unique(wa_crime_df$County)

design <- wa_crime_df %>% 
  filter(County == "king" | County %in% smaller_county_sample) %>% 
  mutate(
    strata_label = if_else(County == "king", "King County", "WA Counties"),
    num_counties = if_else(County == "king", 1, length(county_list) - 1)
  ) %>% 
  as_survey_design(
    ids = c(County, Agency),
    fpc = c(num_counties, num_agencies),
    strata = strata_label
  )

strata_totals <- wa_crime_df %>% 
  mutate(strata = if_else(County == "king","King", "WA Counties")) %>% 
  group_by(strata) %>% 
  summarize(Population = sum(Population)) %>% 
  spread(strata,Population) %>% 
  as.matrix()

separate_estimator <- svyratio(~murder_and_crime, ~Population, design,
                               separate = TRUE)
common_estimator <- svyratio(~murder_and_crime, ~Population, design)
predict(separate_estimator, total = strata_totals)
## $total
##                  Population
## murder_and_crime    68697.7
## 
## $se
##                  Population
## murder_and_crime   2141.227
predict(common_estimator, total = sum(wa_crime_df$Population))
## $total
##                  Population
## murder_and_crime   68915.53
## 
## $se
##                  Population
## murder_and_crime   3370.313
  • Take simple random samples of five police districts from King County and five counties from the rest of the state. use population as an auxiliary variable to construct a common ratio estimate and a separate ratio estimate of the population totals.
king_districts <- wa_crime_df %>%
  filter(County == "king") %>%
  pull(Agency)
sampled_king_districts <- sample(king_districts, 5)
sampled_counties <- sample(county_list, 5)

design <- wa_crime_df %>%
  filter(County %in% sampled_counties | Agency %in% sampled_king_districts) %>%
  mutate(
    strata_label = if_else(County == "king", "King County", "WA Counties"),
    num_counties = if_else(County == "king", 1, length(county_list) - 1),
  ) %>%
  as_survey_design(
    id = c(County, Agency),
    fpc = c(num_counties, num_agencies),
    strata = strata_label
  )


separate_estimator <- svyratio(~murder_and_crime, ~Population, design,
                               separate = TRUE)
common_estimator <- svyratio(~murder_and_crime, ~Population, design)
predict(separate_estimator, total = strata_totals)
## $total
##                  Population
## murder_and_crime   61185.56
## 
## $se
##                  Population
## murder_and_crime   5576.053
predict(common_estimator, total = sum(wa_crime_df$Population))
## $total
##                  Population
## murder_and_crime   61727.29
## 
## $se
##                  Population
## murder_and_crime   8328.097
  1. Using the WA state crime data as a population, take a stratified sample of five police districts from King County and five counties from the rest of the state. Estimate the ratio of violent crimes to non-violent crimes. Compare to the population value.
sampled_king_districts <- sample(king_districts, 5)
sampled_counties <- sample(county_list, 5)

wa_crime_df %>%
  filter(County %in% sampled_counties | Agency %in% sampled_king_districts) %>%
  mutate(
    strata_label = if_else(County == "king", "King County", "WA Counties"),
    num_counties = if_else(County == "king", 1, length(county_list) - 1),
  ) %>%
  as_survey_design(
    id = c(County, Agency),
    fpc = c(num_counties, num_agencies),
    strata = strata_label
  ) %>% 
  summarize(
    violent_non_violent = survey_ratio(violent_crime, property_crime)
  )
## # A tibble: 1 × 2
##   violent_non_violent violent_non_violent_se
##                 <dbl>                  <dbl>
## 1              0.0608                0.00749
round(sum(wa_crime_df$violent_crime) / sum(wa_crime_df$property_crime),2)
## [1] 0.07

We can see that the estimate is quite close to the population value

  1. Using the data from Wave 1 of the 1996 SIPP panel (see Figure 3.8)
  • Estimate the ratio of population totals for monthly rent (tmthrnt) and total household income (thtrninc) over the whole population and over the subpopulation who pay rent.

I think Lumley may have an error here when he says that thtrninc is the total household monthly income - earlier we used thtotinc for this measure as he had increating Figure 3.9. Consequently, I use thtotinc below.

sipp_hh_sub %>% 
  # Total population
  summarize(
    ratio_of_monthly_rent_to_household_income = survey_ratio(tmthrnt, thtotinc)
  )
## # A tibble: 1 × 2
##   ratio_of_monthly_rent_to_household_income ratio_of_monthly_rent_to_household…¹
##                                       <dbl>                                <dbl>
## 1                                   0.00236                            0.0000800
## # ℹ abbreviated name: ¹​ratio_of_monthly_rent_to_household_income_se
sipp_hh_sub %>% 
  filter(tmthrnt > 0) %>% 
  # Rent paying subpopulation
  summarize(
    ratio_of_monthly_rent_to_household_income = survey_ratio(tmthrnt, thtotinc)
  )
## # A tibble: 1 × 2
##   ratio_of_monthly_rent_to_household_income ratio_of_monthly_rent_to_household…¹
##                                       <dbl>                                <dbl>
## 1                                     0.209                              0.00506
## # ℹ abbreviated name: ¹​ratio_of_monthly_rent_to_household_income_se
  • Compute the individual-level ratio, i.e., the proportion of household income paid in rent, and estimate the population mean over the whole population and over the subpopulation who pay rent.
# Full Population
sipp_hh_sub %>% 
  mutate(
    # I also ran the numbers if we excluded those with 0 household rent
    # and the estimates are effectively the same.
    prop_income_rent = if_else(thtotinc == 0, 0, (tmthrnt / thtotinc)),
  ) %>% 
  summarize(
    prop_income_rent_est = survey_mean(prop_income_rent)
  )
## # A tibble: 1 × 2
##   prop_income_rent_est prop_income_rent_est_se
##                  <dbl>                   <dbl>
## 1               0.0221                 0.00624
sipp_hh_sub %>% 
  # Rent paying subpopulation
  filter(tmthrnt > 0) %>% 
  mutate(
    # I also ran the numbers if we excluded those with 0 household rent
    # and the estimates are effectively the same.
    prop_income_rent = if_else(thtotinc == 0, 0, (tmthrnt / thtotinc)) ,
  ) %>% 
  summarize(
    prop_income_rent_est = survey_mean(prop_income_rent)
  )
## # A tibble: 1 × 2
##   prop_income_rent_est prop_income_rent_est_se
##                  <dbl>                   <dbl>
## 1                0.548                   0.154

What are we to make of these estimates being different? Well that’s because they’re estimating two different things. As Lumley points out at the start of the chapter, one is a ratio of two population-level quantities, the other is the population estimate of a ratio measured at the individual level.

  1. Use the stratified sample from the Academic Performance Index population to examine whether the proportion of teachers with only emergency qualifications (emer) affects academic performance (as measured by 2000 API).
  • What confounding variables measuring socioeconomic status of students should be included in the model?

Going off the web documentation of the api dataset from the survey package, it looks like there are a number of possible confounding variables should be included in the model. Here is a list with a brief explanation: * meals: The % of students eligible for subsidized meals. This is a proxy for poverty and is likely correlated with the academic achievement measured by the test.

  • hsg: percent of parents who are high school graduates. Parental academic achievement is likely associated with their students’ academic achievement.

  • avg.ed: Average parental education level - this might be able to combine the above variable along with those who have college or post-graduate education.

  • comp.imp refers to a school “growth” improvement targets that may be related to students’ academic performance.

  • acs.k3 average class size years K-3 - class size is often associated with academic performance There is a similar acs.46 variable for grades 4-6.

  • ell The percent of english language learners. Since most classes are typically instructed in english, a non-native english speakers may struggle more with academic instruction.

  • full percent fully qualified teachers. A fully qualified teacher will presumably be more capable of teaching than one that’s not fully qualified.

  • enroll Enrollment may also be associated with the API, if larger schools have access to greater resources, or inversely, worse teacher to student ratios.

  • Should 1999 API be in the model (and why, or why not?)

    • The value of including the 1999 API score would be that its very likely one of the most correlated variables with the 2000 score. The downside is that it is also likely correlated with all the other measures, including the emer measure, and may mask that variable’s weaker impact. I’ll leave it out in my estimate to try to avoid this problem.
  • Do any of the confounding variables need to be transformed?

    • Several of the binary / categorical variables will need to be transformed into 0 / 1 or cell means encoding. It could also be beneficial to center several of the continuous variables at the mean to offer an easier interpretation of the model.
  • Does emer need to be transformed?

It doesn’t look like it needs to be to me. I include two plots below that visualize the emer distribution (in the target population) as well as its relationship with the api00 measure. While there is a right skew in the distribution, this doesn’t strike me as problematic and I think a log transformation - an attempt to fix the skew - would not be helpful both because of the difficulty in handling 0 values as well as the log-scale interpretation.

It could be worth centering the emer value by the estimated population mean to make the interpretation of the intercept more valueable but I don’t think that’s necessary for an adequate model interpretation here.

svyhist(~emer, strat_design)

svyplot(api00 ~ emer, strat_design, 
        main = "CA 2000 API Scores vs. Teacher Emergency Training Preparedness",
        xlab = "Teachers with emergency training",
        ylab = "API 2000")

  • What is the conclusion at the end?
fit <- svyglm(api00 ~ emer + meals + hsg + avg.ed + comp.imp +
                acs.k3 + acs.46 + ell + full + enroll,
              design = strat_design)
summary(fit)
## 
## Call:
## svyglm(formula = api00 ~ emer + meals + hsg + avg.ed + comp.imp + 
##     acs.k3 + acs.46 + ell + full + enroll, design = strat_design)
## 
## Survey design:
## svydesign(id = ~1, strata = ~stype, fpc = ~fpc, data = apistrat)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 652.826459 163.333076   3.997 0.000137 ***
## emer         -0.726831   1.485803  -0.489 0.625986    
## meals        -2.378649   0.483856  -4.916 4.32e-06 ***
## hsg           0.424671   0.416138   1.021 0.310419    
## avg.ed       48.006215  17.485877   2.745 0.007390 ** 
## comp.impYes  15.075954  14.397161   1.047 0.298035    
## acs.k3       -3.366084   4.984776  -0.675 0.501357    
## acs.46        2.989147   1.389796   2.151 0.034367 *  
## ell          -0.228954   0.407923  -0.561 0.576110    
## full          0.006335   1.242588   0.005 0.995944    
## enroll       -0.042131   0.035810  -1.176 0.242720    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3905.969)
## 
## Number of Fisher Scoring iterations: 2

From the output above, we see that the emer value isn’t found to be significantly associated (at \(\alpha = 0.05\)) with the 2000 API measure after adjusting for other variables. Indeed, meals, avg.ed and acs.46 are the only values for which there is evidence to support a relationship at this level.

  1. Following on from the previous exercise, fit the same model to the whole population (the data set apipop) using the glm() function.
fit_pop <- glm(api00 ~  emer + meals + hsg + avg.ed + comp.imp +
                acs.k3 + acs.46 + ell + full + enroll, data = apipop)
summary(fit_pop)
## 
## Call:
## glm(formula = api00 ~ emer + meals + hsg + avg.ed + comp.imp + 
##     acs.k3 + acs.46 + ell + full + enroll, data = apipop)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 460.895349  21.325334  21.613  < 2e-16 ***
## emer          0.542863   0.171137   3.172  0.00152 ** 
## meals        -2.180465   0.058205 -37.462  < 2e-16 ***
## hsg           0.209642   0.078151   2.683  0.00734 ** 
## avg.ed       50.288082   2.345033  21.445  < 2e-16 ***
## comp.impYes  30.659677   2.103305  14.577  < 2e-16 ***
## acs.k3        0.816373   0.556165   1.468  0.14222    
## acs.46        0.863564   0.268385   3.218  0.00130 ** 
## ell          -0.488302   0.064119  -7.616 3.25e-14 ***
## full          1.560638   0.152078  10.262  < 2e-16 ***
## enroll       -0.035925   0.004889  -7.348 2.43e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2748.517)
## 
##     Null deviance: 70739140  on 4070  degrees of freedom
## Residual deviance: 11158979  on 4060  degrees of freedom
##   (2123 observations deleted due to missingness)
## AIC: 43803
## 
## Number of Fisher Scoring iterations: 2
  • Do the sample estimates agree with the population data? Do your decisions about transforming variables hold up in the population data?

Some of the sample estimates agree with the population data. It is very clear that there is a whole lot more power to detect associations when the entire population is present. In brief, we see that the three variables for which an association was detected previously — meals, acs.46 and avg.edu — all have similar (within the sample based estimate margin of error) estimates on the full population data. Additionally, many other variables now have significant associations that did not previously. Notably the emer variable has a positive association with the api00 such that we’d expect to see a .5 gain in API score for every additional percent gain in teachers that are emergency qualified.

  • Fit the same model to 100 stratified samples from the population. Is the sampling distribution of the coefficients close to a Normal distribution?
OneSimulation <- function() {
  coefs <- apipop %>% 
    group_by(stype) %>% 
    slice_sample(n = 66) %>% 
    ungroup() %>% 
    as_survey_design(
      strata = stype
    ) %>% 
    svyglm(api00 ~  emer + meals + hsg + avg.ed + comp.imp +
                acs.k3 + acs.46 + ell + full + enroll, design = .) %>% 
    coef(.)
  return(coefs)
}
coef_dist <- replicate(100, OneSimulation())
par(mfrow = c(1,2))
hist(coef_dist[1,], main = "Histogram of Intercept")
hist(coef_dist[2,], main = "Histogram of emer")

Yes, as we’d expect the sampling distributions are roughly normal.

  1. Using the blood pressure data from NHANES 2003 - 2006, investigate the effect of obesity on blood pressure using the Body Mass Index and blood pressure data.
  • What variables in the data set are potential confounders?

Given the discussion in the book’s example, sodium and potassium intake are likely confounders alongside the usual, age, sex race and socioeconomic status. That said, i don’t know if the last of these two are available given that the variable names in the dataset are not particularly descriptive.

  • Are there important confounders that are not measured?

See above - race and socioeconomic status stand out as two confounding variables that don’t appear to be measured in this dataset.

  • Fit one or more suitable regression models and summarize the output.

I’ll fit the same model as model2 in the text since Lumley explains what the variables are in that model.

fit <- svyglm(BPXSAR ~ sodium + potassium + RIDAGEYR + RIAGENDR + BMXBMI,
                 design = des)
summary(fit)
## 
## Call:
## svyglm(formula = BPXSAR ~ sodium + potassium + RIDAGEYR + RIAGENDR + 
##     BMXBMI, design = des)
## 
## Survey design:
## update(des, sodium = DR1TSODI/1000, potassium = DR1TPOTA/1000)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 97.32903    1.42668  68.220  < 2e-16 ***
## sodium       0.43458    0.16164   2.689   0.0126 *  
## potassium   -0.96119    0.17043  -5.640 7.19e-06 ***
## RIDAGEYR     0.45791    0.01080  42.380  < 2e-16 ***
## RIAGENDR    -3.38208    0.38403  -8.807 3.90e-09 ***
## BMXBMI       0.38460    0.03797  10.129 2.48e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 263.5461)
## 
## Number of Fisher Scoring iterations: 2

According to this model there is a .38 mm Hg expected increase in systolic blood pressure for every one unit increase in BMI after adjusting for age, sex, and daily sodium and potassium intake. In other words we’d expect a higher blood pressure amongst those with higher BMIs.

  • Examine whether there is an interaction with age or sex.
fit <- svyglm(BPXSAR ~ (RIDAGEYR + RIAGENDR)*I(BMXBMI - 25) + sodium + potassium,
              design = des)
summary(fit)
## 
## Call:
## svyglm(formula = BPXSAR ~ (RIDAGEYR + RIAGENDR) * I(BMXBMI - 
##     25) + sodium + potassium, design = des)
## 
## Survey design:
## update(des, sodium = DR1TSODI/1000, potassium = DR1TPOTA/1000)
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             107.178639   1.200310  89.292  < 2e-16 ***
## RIDAGEYR                  0.462853   0.011268  41.076  < 2e-16 ***
## RIAGENDR                 -3.485444   0.353566  -9.858 1.00e-09 ***
## I(BMXBMI - 25)            0.510054   0.104374   4.887 6.18e-05 ***
## sodium                    0.438810   0.163223   2.688 0.013119 *  
## potassium                -0.989246   0.169362  -5.841 5.95e-06 ***
## RIDAGEYR:I(BMXBMI - 25)  -0.005031   0.001305  -3.855 0.000806 ***
## RIAGENDR:I(BMXBMI - 25)   0.041993   0.059821   0.702 0.489736    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 263.1093)
## 
## Number of Fisher Scoring iterations: 2

I center the BMI variable at 25 (roughly the marginal average) and fit the model with the age and sex interactions with the centered BMI. The model fit shows that there’s a significant negative association with the BMI-age interaction and no association with the gender/sex - age interaction. This corresponds to a amplifying effect of age — the older you are the lower your blood pressure is, on average, a higher BMI will then also lead to a lower expected blood pressure in addition to this affect.

  1. Prove that an unweighted regression estimator is approximately unbiased when the weights depend only on variables in the model. Specifically, if the true population regression coefficients \(\beta^*\) satisfy:

\[ \sum_{i=1}^{N} x_i(y_i - x_i\beta^*) = 0 \]

and \(R_i\) indicates that observation \(i\) is in the sample prove that

\[ E \left [ \sum_{i=1}^{N} R_ix_i (y_i - x_i \beta^*)\right ] = 0 \]

so that the unweighted sample estimating equations are unbiased.

5.8 A rough approximation to the loss of efficiency from unnecessarily using weights can be constructed by considering the variance of the residuals in weighted and unweighted estimation. Assume as an approximation that the residuals \(r_i\) are independent of the sampling weights \(w_i\)

  • Show that \[ V[\sum_{i=1}^{n} w_ir_i] = E[w^2] V[r] + V[w]E[r^2] \]

  • Now assume that the mean of the residuals is zero, and show that the relative efficiency of the unweighted estimate is \(1 + cv(w)\) where \(cv\) is the coefficient of variation, the ratio of the standard deviation to the mean.

Chapter 6: Categorical Data Regression

Chapter 7: Post-Stratification, Raking, and Calibration

Introduction - Motivation

Lumley motivates the need to explore the three titular topics by expanding on the principle developed in the second chapter — stratification. Similar as to how making use of the extra information available in strata we can improve estimates in straightforward estimation of totals and means, Lumley’s focus in this chapter is how to use the “auxiliary” information to adjust for non-response bias and improve the precision of the estimates.

Post-Stratification

Post-stratification is exactly what it sounds like - re-weighting estimates according to strata totals after or apart from any initial strata that might have been involved in the inital sampling design.

Consider a relatively straightforward design in which there’s a population of subjects of size \(N\) that can be partitioned into \(K\) mutually exclusive strata from which any of the \(N_k\) individuals in that strata can be sampled for \(n_k\) strata samples. In this setting the sampling weights for each individual in group \(k\) is \(\frac{N_k}{n_k}\) and \(N_k\) is known without any uncertainty.

If the sampling were not stratified but \(N_k\) were still known, the group sizes would not be exactly correct by simple Horvitz-Thompson estimation, but they could be corrected by re-weighting so that the sizes are correct as they would be in stratified sampling.

Specifically, take each weight \(w_i = \frac{1}{\pi_i}\) and construct new weights \(w_i^* = \frac{g_i}{\pi_i} = \frac{N_k}{\hat{N}_k} \times \frac{1}{\pi_i}\).

For estimating the group side of the kth group then, we’ll have

\[ n_k \times \frac{g_i}{\pi_i} = n_k \times \frac{1}{\pi_i} \times \frac{N_k}{\hat{N}_k} = n_k \times \frac{\hat{N}_k}{n_k} \times \frac{N_k}{\hat{N}_k} = N_k, \] where \(\pi_i = \frac{n_k}{\hat{N}_k}\). The consequence of this re-weighting means that the estimated sub group population is exactly correct and subsequent estimates within or across these groups benefit from the extra information.

Of course, as Lumley notes, there’s a problem if no entities were sampled in the particular strata of interest - you can’t re-weight the number 0. Still since this is unlikely to happen for groups and samples of “reasonable” size post-stratification is still a worthy strategy given the potential reductions in variance that are possible.

Illustration

Lumley’s illustration of post-stratification looks at the two-stage sample drawn from the API population, with 40 school districts sampled from California and then up to 5 schools sampled from each district. Lumley uses this example to illustrate how improvements to precision can be made via post-stratification – or not.

We’ll start with a reminder of the sample design used here: a two-stage sample.

clus2_design
## 2 - level Cluster Sampling design
## With (40, 126) clusters.
## svydesign(id = ~dnum + snum, fpc = ~fpc1 + fpc2, data = apiclus2)

Then information about the population group sizes is included in the call to postStratify() as well as the variable/strata across which to post-stratify.

pop.types <- data.frame(stype = c("E", "H", "M"), Freq = c(4421, 755, 1018))
ps_design <- postStratify(clus2_design, strata = ~stype, population = pop.types)
ps_design
## 2 - level Cluster Sampling design
## With (40, 126) clusters.
## postStratify(clus2_design, strata = ~stype, population = pop.types)

Totals, and so on are then estimated in the usual fashion. In this example there’s a large difference in the variability of the estimated total when comparing the naive and post-stratified estimates because much of the variability in the number of students enrolled can be explained by the school type - elementary schools are typically smaller than middle schools and highschools. By including more information about the types of schools in the overall population, the standard error is decreased by a factor of ~ 2.6.

svytotal(~enroll, clus2_design, na.rm = TRUE)
##          total     SE
## enroll 2639273 799638
svytotal(~enroll, ps_design, na.rm = TRUE)
##          total     SE
## enroll 3074076 292584

In contrast, school type is not associated with the variability in the school scores measured by the Academic Performance Index - denoted below by api00.

svymean(~api00, clus2_design)
##         mean     SE
## api00 670.81 30.099
svymean(~api00, ps_design)
##       mean     SE
## api00  673 28.832

Indeed, the score is specifically setup to be standardized across school types and as such there’s little variance reduction observed by using the post-stratification information in this instance.

Lumley notes that if the api dataset were a real survey, non response might vary as a function of school type and in which case post-stratification could help reduce non-response bias.

Raking

If one were to post-stratify using more than one variable would require the complete joint distribution of both variables. This can be problematic either because the data for the joint distribution - or cross classification as Lumley calls it - is not available, or because one of the cross - strata have no observations in the sample.

Chapter 8: Two-Phase Sampling

MultiStage and Multiphase Sampling

In comparison to multistage sampling, where individuals or clusters were sampled independently of one another across stages, multiphase sampling is a design where individuals are sampled dependent upon information obtained in the first sample. Consequently, instead of having two sampling probabilities which are multiplied by each other \(\pi_1 \times \pi_2\) we have conditional weights \(\pi_1\) and \(\pi_{2|1}\) which describe the probability of an entity being sampled in phase one and the the probability of a phase being sampled in stage two, conditional on being sampled in phase one.

Chapter 9: Missing Data

Chapter 10: Causal Inference

Session Info

sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.6.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## Random number generation:
##  RNG:     Mersenne-Twister 
##  Normal:  Inversion 
##  Sample:  Rounding 
##  
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Denver
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] addhazard_1.1.0   sf_1.0-16         quantreg_5.98     SparseM_1.84-2   
##  [5] lattice_0.22-6    DiagrammeR_1.0.11 RSQLite_2.3.6     patchwork_1.2.0  
##  [9] survey_4.2-1      survival_3.5-7    Matrix_1.6-1.1    srvyr_1.2.0      
## [13] lubridate_1.9.3   forcats_1.0.0     stringr_1.5.1     dplyr_1.1.4      
## [17] purrr_1.0.2       readr_2.1.4       tidyr_1.3.0       tibble_3.2.1     
## [21] ggplot2_3.4.4     tidyverse_2.0.0   gt_0.10.0         haven_2.5.3      
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0   farver_2.1.1       blob_1.2.4         fastmap_1.1.1     
##  [5] digest_0.6.33      timechange_0.2.0   lifecycle_1.0.4    ellipsis_0.3.2    
##  [9] magrittr_2.0.3     compiler_4.3.2     rlang_1.1.2        sass_0.4.7        
## [13] tools_4.3.2        utf8_1.2.4         yaml_2.3.7         knitr_1.45        
## [17] ahaz_1.15          labeling_0.4.3     htmlwidgets_1.6.3  bit_4.0.5         
## [21] classInt_0.4-10    xml2_1.3.5         RColorBrewer_1.1-3 KernSmooth_2.23-22
## [25] withr_2.5.2        fansi_1.0.5        e1071_1.7-14       colorspace_2.1-0  
## [29] scales_1.3.0       MASS_7.3-60        cli_3.6.1          rmarkdown_2.25    
## [33] generics_0.1.3     rstudioapi_0.15.0  tzdb_0.4.0         sampling_2.10     
## [37] visNetwork_2.1.2   readxl_1.4.3       DBI_1.2.2          cachem_1.0.8      
## [41] proxy_0.4-27       splines_4.3.2      cellranger_1.1.0   mitools_2.4       
## [45] vctrs_0.6.4        jsonlite_1.8.7     hms_1.1.3          bit64_4.0.5       
## [49] hexbin_1.28.3      jquerylib_0.1.4    units_0.8-5        glue_1.6.2        
## [53] stringi_1.8.2      gtable_0.3.4       munsell_0.5.0      pillar_1.9.0      
## [57] htmltools_0.5.7    dbplyr_2.4.0       R6_2.5.1           evaluate_0.23     
## [61] lpSolve_5.6.20     highr_0.10         memoise_2.0.1      bslib_0.6.1       
## [65] class_7.3-22       MatrixModels_0.5-3 Rcpp_1.0.11        xfun_0.43         
## [69] pkgconfig_2.0.3

References

DuMouchel, William H, and Greg J Duncan. 1983. “Using Sample Survey Weights in Multiple Regression Analyses of Stratified Samples.” Journal of the American Statistical Association 78 (383): 535–43.
Hanif, Muhammad, and KRW Brewer. 1980. “Sampling with Unequal Probabilities Without Replacement: A Review.” International Statistical Review/Revue Internationale de Statistique, 317–35.
Judkins, David R. 1990. “Fay’s Method for Variance Estimation.” Journal of Official Statistics 6 (3): 223–39.
Kish, Leslie. 1965. Survey Sampling. New York: Chichester Wiley.
Lumley, Thomas. 2011. Complex Surveys: A Guide to Analysis Using r. Vol. 565. John Wiley & Sons.
McCarthy, Philip J. 1966. “Replication, an Approach to the Analysis of Data from Complex Surveys.”
Meng, Xiao-Li. 2018. “STATISTICAL PARADISES AND PARADOXES IN BIG DATA (i) LAW OF LARGE POPULATIONS, BIG DATA PARADOX, AND THE 2016 US PRESIDENTIAL ELECTION.” The Annals of Applied Statistics 12 (2): 685–726.
Tillé, Yves. 2006. Sampling Algorithms. Springer.